SlideShare a Scribd company logo
1 of 30
Download to read offline
Overlap Layout Consensus assembly
Ben Langmead
You are free to use these slides. If you do, please sign the
guestbook (www.langmead-lab.org/teaching-materials), or email
me (ben.langmead@gmail.com) and tell me briefly how you’re
using them. For original Keynote files, email me.
Real-world assembly methods
Both handle unresolvable repeats by essentially leaving them out
Fragments are contigs (short for contiguous)
Unresolvable repeats break the assembly into fragments
OLC: Overlap-Layout-Consensus assembly
DBG: De Bruijn graph assembly
a_long_long_long_time
a_long_long_time a_long	
  	
  	
  	
  	
  long_time
Assemble substrings
with Greedy-SCS
Assemble substrings
with OLC or DBG
Assembly alternatives
Alternative 1: Overlap-Layout-Consensus (OLC) assembly
Alternative 2: de Bruijn graph (DBG) assembly
Overlap
Layout
Consensus
Error correction
de Bruijn graph
Scaffolding
Refine
Overlap Layout Consensus
Overlap
Layout
Consensus
Build overlap graph
Bundle stretches of the overlap graph into contigs
Pick most likely nucleotide sequence for each contig
Finding overlaps
Can we be less naive than this?
CTCTAGGCC
TAGGCCCTC
X:
Y:
Say l = 3
CTCTAGGCC
TAGGCCCTC
X:
Y:
Look for this in Y,
going right-to-left
Found it
CTCTAGGCC
TAGGCCCTC
X:
Y:
Extend to left; in this case, we
confirm that a length-6 prefix
of Y matches a suffix of X
We’re doing this for every pair of input strings
Finding overlaps
Can we use suffix trees for overlapping?
Problem: Given a collection of strings S, for each string x in S find all
overlaps involving a prefix of x and a suffix of another string y
Hint: Build a generalized suffix tree of the strings in S
Finding overlaps with suffix tree
Generalized suffix tree for {“GACATA”,“ATAGAC”} GACATA$0ATAGAC$1
A
6
$0 C
13
$1 GAC TA
5
$0 C TA
9
GAC$ 1
1
ATA$0
11
$1
3
$0
7
GAC$ 1
2
ATA$0
12
$1
0
ATA$0
10
$1
4
$0
8
GAC$ 1
Say query = GACATA. From root, follow path
labeled with query.
Green edge implies length-3 suffix of second
string equals length-3 prefix of queryATAGAC
	
  	
  	
  |||
	
  	
  	
  GACATA
Finding overlaps with suffix tree
Generalized suffix tree for {“GACATA”,“ATAGAC”} GACATA$0ATAGAC$1
A
6
$0 C
13
$1 GAC TA
5
$0 C TA
9
GAC$ 1
1
ATA$0
11
$1
3
$0
7
GAC$ 1
2
ATA$0
12
$1
0
ATA$0
10
$1
4
$0
8
GAC$ 1
For each string: Walk down from root and report
any outgoing edge labeled with a separator.
Each corresponds to a prefix/suffix match
involving prefix of query string and suffix of
string ending in the separator.
Strategy:
(1) Build tree
(2)
Finding overlaps with suffix tree
Generalized suffix tree for {“GACATA”,“ATAGAC”} GACATA$0ATAGAC$1
A
6
$0 C
13
$1 GAC TA
5
$0 C TA
9
GAC$ 1
1
ATA$0
11
$1
3
$0
7
GAC$ 1
2
ATA$0
12
$1
0
ATA$0
10
$1
4
$0
8
GAC$ 1
GACATA
	
  	
  	
  |||
	
  	
  	
  ATAGAC
GACATA
	
  	
  	
  	
  	
  |
	
  	
  	
  	
  	
  ATAGAC
ATAGAC
	
  	
  	
  |||
	
  	
  	
  GACATA
Now let query be second string: ATAGAC
Finding overlaps with suffix tree
A
6
$0 C
13
$1 GAC TA
5
$0 C TA
9
GAC$ 1
1
ATA$0
11
$1
3
$0
7
GAC$ 1
2
ATA$0
12
$1
0
ATA$0
10
$1
4
$0
8
GAC$ 1
Say there are d reads of length n, total length
N = dn, and a = # read pairs that overlap
Time to build generalized suffix tree: O(N)
... to walk down red paths: O(N)
... to find & report overlaps (green): O(a)
Overall: O(N + a)
d2 doesn’t appear explicitly,
but a is O(d2) in worst case
Assume for given string pair we report only the longest suffix/prefix match
Finding overlaps
What if we want to allow mismatches and
gaps in the overlap? CTCGGCCCTAGG
	
  	
  	
  |||	
  |||||
	
  	
  	
  GGCTCTAGGCCC
X:
Y:I.e. How do we find the best alignment of a
suffix of X to a prefix of Y?
Dynamic programming
But we must frame the problem such that only backtraces
involving a suffix of X and a prefix of Y are allowed
Finding overlaps with dynamic programming
CTCGGCCCTAGG
	
  	
  	
  |||	
  |||||
	
  	
  	
  GGCTCTAGGCCC
X:
Y:
We’ll use global alignment recurrence and score function
Find the best alignment of a suffix of X to a
prefix of Y
A C G T -­‐
A 0 4 2 4 8
C 4 0 4 2 8
G 2 4 0 4 8
T 4 2 4 0 8
-­‐ 8 8 8 8
s(a, b)
D[i, j] = min
8
<
:
D[i 1, j] + s(x[i 1], )
D[i, j 1] + s( , y[j 1])
D[i 1, j 1] + s(x[i 1], y[j 1])
But how do we force it to find prefix / suffix matches?
Finding overlaps with dynamic programming
Find the best alignment of a suffix of X to a prefix of Y
A C G T -­‐
A 0 4 2 4 8
C 4 0 4 2 8
G 2 4 0 4 8
T 4 2 4 0 8
-­‐ 8 8 8 8
s(a, b)
D[i, j] = min
8
<
:
D[i 1, j] + s(x[i 1], )
D[i, j 1] + s( , y[j 1])
D[i 1, j 1] + s(x[i 1], y[j 1])
-­‐ G G C T C T A G G C C C
-­‐
C
T
C
G
G
C
C
C
T
A
G
G
X
Y
How to initialize first row & column
so suffix of X aligns to prefix of Y?
0
0
0
0
0
0
0
0
0
0
0
0
0
∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
First column gets 0s
(any suffix of X is possible)
First row gets ∞s
(must be a prefix of Y)
4 12 20 28 36 44 52 60 68 76 84 92
4 8 14 20 28 36 44 52 60 68 76 84
4 8 8 16 20 28 36 44 52 60 68 76
0 4 12 12 20 24 30 36 44 52 60 68
0 0 8 16 16 24 26 30 36 44 52 60
4 4 0 8 16 18 26 30 34 36 44 52
4 8 4 2 8 16 22 30 34 34 36 44
4 8 8 6 2 10 18 26 34 34 34 36
4 8 10 8 8 2 10 18 26 34 36 36
2 6 12 14 12 10 2 10 18 26 34 40
0 2 10 16 18 16 10 0 10 18 26 34
0 0 6 14 20 22 18 10 2 10 18 26
CTCGGCCCTAGG
	
  	
  	
  |||	
  |||||
	
  	
  	
  GGCTCTAGGCCC
X:
Y:
Backtrace from last row
Finding overlaps with dynamic programming
Find the best alignment of a suffix of X to a prefix of Y
A C G T -­‐
A 0 4 2 4 8
C 4 0 4 2 8
G 2 4 0 4 8
T 4 2 4 0 8
-­‐ 8 8 8 8
s(a, b)
D[i, j] = min
8
<
:
D[i 1, j] + s(x[i 1], )
D[i, j 1] + s( , y[j 1])
D[i 1, j 1] + s(x[i 1], y[j 1])
-­‐ G G C T C T A G G C C C
-­‐
C
T
C
G
G
C
C
C
T
A
G
G
X
Y
Problem: very short matches
got high scores by chance...
0
0
0
0
0
0
0
0
0
0
0
0
0
∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
4 12 20 28 36 44 52 60 68 76 84 92
4 8 14 20 28 36 44 52 60 68 76 84
4 8 8 16 20 28 36 44 52 60 68 76
0 4 12 12 20 24 30 36 44 52 60 68
0 0 8 16 16 24 26 30 36 44 52 60
4 4 0 8 16 18 26 30 34 36 44 52
4 8 4 2 8 16 22 30 34 34 36 44
4 8 8 6 2 10 18 26 34 34 34 36
4 8 10 8 8 2 10 18 26 34 36 36
2 6 12 14 12 10 2 10 18 26 34 40
0 2 10 16 18 16 10 0 10 18 26 34
0 0 6 14 20 22 18 10 2 10 18 26
...which might obscure the more
relevant match
Say we want to enforce
minimum overlap length l = 5
Finding overlaps with dynamic programming
Find the best alignment of a suffix of X to a prefix of Y
A C G T -­‐
A 0 4 2 4 8
C 4 0 4 2 8
G 2 4 0 4 8
T 4 2 4 0 8
-­‐ 8 8 8 8
s(a, b)
D[i, j] = min
8
<
:
D[i 1, j] + s(x[i 1], )
D[i, j 1] + s( , y[j 1])
D[i 1, j 1] + s(x[i 1], y[j 1])
-­‐ G G C T C T A G G C C C
-­‐ 0 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞
C 0 4 12 20 28 36 44 52 60 68 76 84 92
T 0 4 8 14 20 28 36 44 52 60 68 76 84
C 0 4 8 8 16 20 28 36 44 52 60 68 76
G 0 0 4 12 12 20 24 30 36 44 52 60 68
G 0 0 0 8 16 16 24 26 30 36 44 52 60
C 0 4 4 0 8 16 18 26 30 34 36 44 52
C 0 4 8 4 2 8 16 22 30 34 34 36 44
C 0 4 8 8 6 2 10 18 26 34 34 34 36
T ∞ 4 8 10 8 8 2 10 18 26 34 36 36
A ∞ 12 6 12 14 12 10 2 10 18 26 34 40
G ∞ 20 12 10 16 18 16 10 0 10 18 26 34
G ∞ ∞ ∞ ∞ ∞ 20 22 18 10 2 10 18 26
X
Y
Solve by initializing certain
additional cells to ∞
Cells whose values changed
highlighted in red
Now the relevant match is the
best candidate
Finding overlaps with dynamic programming
Number of overlaps to try: O(d2)
Size of each dynamic programming matrix: O(n2)
Overall: O(d2n2) = O(N2)
Say there are d reads of length n, total length N = dn, and a is total
number of pairs with an overlap
Contrast O(N2) with suffix tree: O(N + a), but where a is worst-case O(d2)
But dynamic programming is more flexible, allowing mismatches and gaps
Real-world overlappers mix the two, using indexes to filter out vast majority of
non-overlapping pairs, then using dynamic programming for remaining pairs
Finding overlaps
Overlapping is typically the slowest part of assembly
Consider a second-generation sequencing dataset with
hundreds of millions or billions of reads!
Approaches from alignment unit can be adapted to finding overlaps
Could also have adapted efficient exact matching,
approximate string matching, co-traversal, ...
We saw adaptations of naive exact matching, suffix-tree-
assisted exact matching, and dynamic programming
Finding overlaps
http://wgs-assembler.sourceforge.net/wiki/index.php/RunCA#Overlapper
Celera Assembler’s overlapper is probably the best documented:
Inverted substring indexes built on batches of reads
Only look for overlaps between reads that share one or more
substrings of some length
Overlap Layout Consensus
Overlap
Layout
Consensus
Build overlap graph
Bundle stretches of the overlap graph into contigs
Pick most likely nucleotide sequence for each contig
Layout
Overlap graph is big and messy. Contigs don’t“pop out”at us.
Below: part of the overlap graph for
to_every_thing_turn_turn_turn_there_is_a_season
l = 4, k = 7
ry_thin
thing_t
4
_thing_
5y_thing
6
urn_tur
rn_turn
6
_turn_t
4
n_turn_
5
5
6
turn_tu
4
turn_th
4
here_is
e_is_a_
4
ere_is_
6
re_is_a
5
ing_tur
5hing_tu
6
ng_turn
4
urn_the
_there_
4
n_there
5rn_ther
6
5
4
there_i
6
4
6
4
5
6
6
4
5
5
4
5
4
66
4
6
5
4
6
g_turn_
5
5
6
4
5
6
4
6
5
4
4
4
6
55
6
5
4
5
44
6
5
6
4
6
4
5
6
5
4
6
4
5
4
4
6
55
ry_thin
4
_thing_
5y_thing
6
very_th
5ery_thi
6
4
_every_
5
4
every_t
6
6
4
5
4
6
5
o_every
4
6
5
5
6
to_ever
5
4
6
Layout
Anything redundant about this part of the
overlap graph?
Some edges can be inferred (transitively) from
other edges
E.g. green edge can be inferred from blue
Layout
Remove transitively-inferrible edges, starting with edges that skip one
node:
ry_thin
thing_t
4
_thing_
5y_thing
6
urn_tur
rn_turn
6
_turn_t
4
n_turn_
5
5
6
turn_tu
4
turn_th
4
here_is
e_is_a_
4
ere_is_
6
re_is_a
5
ing_tur
5hing_tu
6
ng_turn
4
urn_the
_there_
4
n_there
5rn_ther
6
5
4
there_i
6
4
6
4
5
6
6
4
5
5
4
5
4
66
4
6
5
4
6
g_turn_
5
5
6
4
5
6
4
6
5
4
4
4
6
55
6
5
4
5
44
6
5
6
4
6
4
5
6
5
4
6
4
5
4
4
6
55
Before:
x
Layout
y_thing
urn_tur
rn_turn
6
n_turn_
6
here_is
ere_is_
6
thing_t
hing_tu
6
urn_the
rn_ther
6
_there_
there_i
6
_thing_
6
ing_tur
4
_turn_t
turn_tu
6
turn_th
6
n_there
4
6
ng_turn
6
6
6
6
4
4
4
6
6
g_turn_
4
6
6
4
6
4
4
4
6
After:
x
Remove transitively-inferrible edges, starting with edges that skip one
node:
Layout
ry_thin
y_thing
6
urn_tur
rn_turn
6
n_turn_
6
here_is
ere_is_
6
thing_t
hing_tu
6
urn_the
rn_ther
6
_there_
there_i
6
_thing_
6
6
_turn_t
turn_tu
6
turn_th
6
n_there
6
ing_tur
ng_turn
6
6
6
4
6
6
g_turn_
6
6
4
6
4
6
x
Remove transitively-inferrible edges, starting with edges that skip one
or two nodes: x
Even simpler
After:
Layout
Emit contigs corresponding to the non-branching stretches
ry_thin
y_thing
6
urn_tur
rn_turn
6
a_seaso
_season
6
n_turn_
6
here_is
ere_is_
6
thing_t
hing_tu
6
urn_the
rn_ther
6
_there_
there_i
6
very_th
ery_thi
6
_every_
every_t
6
_thing_
6
e_is_a_
_is_a_s
6
6
6
_turn_t
turn_tu
6
turn_th
6
n_there
6
o_every
6
ing_tur
ng_turn
6
re_is_a
6
is_a_se
s_a_sea
6
6
6
4
6
6
_a_seas
6
g_turn_
6
to_ever
6
6
6
4
6
6
6
4
6
to_every_thing_turn_ turn_there_is_a_season
Contig 1 Contig 2
Unresolvable repeat
Layout
In practice, layout step also has to deal with spurious subgraphs, e.g.
because of sequencing error
Possible repeat
boundary
Mismatcha
b
Mismatch could be due to sequencing error or repeat. Since the path
through b ends abruptly we might conclude it’s an error and prune b.
...
a
bprune
Overlap Layout Consensus
Overlap
Layout
Consensus
Build overlap graph
Bundle stretches of the overlap graph into contigs
Pick most likely nucleotide sequence for each contig
Consensus
Take reads that make
up a contig and line
them up
At each position, ask: what nucleotide (and/or gap) is here?
Complications: (a) sequencing error, (b) ploidy
Say the true genotype is AG, but we have a high sequencing error rate
and only about 6 reads covering the position.
TAGATTACACAGATTACTGA	
  TTGATGGCGTAA	
  CTA
TAGATTACACAGATTACTGACTTGATGGCGTAAACTA
TAG	
  TTACACAGATTATTGACTTCATGGCGTAA	
  CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA	
  CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA	
  CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA	
  CTA
Take consensus, i.e.
majority vote
Overlap Layout Consensus
Overlap
Layout
Consensus
Build overlap graph
Bundle stretches of the overlap graph into contigs
Pick most likely nucleotide sequence for each contig
OLC drawbacks
Building overlap graph is slow. We saw O(N + a) and O(N2) approaches.
2nd-generation sequencing datasets are ~ 100s of millions or billions
of reads, hundreds of billions of nucleotides total
Overlap graph is big; one node per read, and in practice # edges
grows superlinearly with # reads
Assembly alternatives
Alternative 1: Overlap-Layout-Consensus (OLC) assembly
Alternative 2: de Bruijn graph (DBG) assembly
Overlap
Layout
Consensus
Error correction
de Bruijn graph
Scaffolding
Refine

More Related Content

What's hot

Hidden markov model
Hidden markov modelHidden markov model
Hidden markov modelUshaYadav24
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithmavrilcoghlan
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation SequencingArindam Ghosh
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission ToolsRishikaMaji
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithmavrilcoghlan
 
Tech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserTech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserHoffman Lab
 
second generation of DNA Sequencing
second generation of DNA Sequencingsecond generation of DNA Sequencing
second generation of DNA SequencingSidra Shaffique
 
Integrative omics approches
Integrative omics approches   Integrative omics approches
Integrative omics approches Sayali Magar
 
DNA microarray final ppt.
DNA microarray final ppt.DNA microarray final ppt.
DNA microarray final ppt.Aashish Patel
 

What's hot (20)

Hidden markov model
Hidden markov modelHidden markov model
Hidden markov model
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
The Cre-LoxP System
The Cre-LoxP SystemThe Cre-LoxP System
The Cre-LoxP System
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Clustal X
Clustal XClustal X
Clustal X
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission Tools
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
 
SNP
SNPSNP
SNP
 
Pcr primer design english version
Pcr primer design english versionPcr primer design english version
Pcr primer design english version
 
Tech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserTech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome Browser
 
second generation of DNA Sequencing
second generation of DNA Sequencingsecond generation of DNA Sequencing
second generation of DNA Sequencing
 
Integrative omics approches
Integrative omics approches   Integrative omics approches
Integrative omics approches
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
DNA Sequencing
DNA Sequencing DNA Sequencing
DNA Sequencing
 
DNA microarray final ppt.
DNA microarray final ppt.DNA microarray final ppt.
DNA microarray final ppt.
 

Similar to OLC assembly overlap layout consensus

My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...Alexander Litvinenko
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018Zahari Dichev
 
Data sparse approximation of the Karhunen-Loeve expansion
Data sparse approximation of the Karhunen-Loeve expansionData sparse approximation of the Karhunen-Loeve expansion
Data sparse approximation of the Karhunen-Loeve expansionAlexander Litvinenko
 
Data sparse approximation of Karhunen-Loeve Expansion
Data sparse approximation of Karhunen-Loeve ExpansionData sparse approximation of Karhunen-Loeve Expansion
Data sparse approximation of Karhunen-Loeve ExpansionAlexander Litvinenko
 
Graphical Model Selection for Big Data
Graphical Model Selection for Big DataGraphical Model Selection for Big Data
Graphical Model Selection for Big DataAlexander Jung
 
Integration
IntegrationIntegration
Integrationlecturer
 
A common unique random fixed point theorem in hilbert space using integral ty...
A common unique random fixed point theorem in hilbert space using integral ty...A common unique random fixed point theorem in hilbert space using integral ty...
A common unique random fixed point theorem in hilbert space using integral ty...Alexander Decker
 
Application of parallel hierarchical matrices and low-rank tensors in spatial...
Application of parallel hierarchical matrices and low-rank tensors in spatial...Application of parallel hierarchical matrices and low-rank tensors in spatial...
Application of parallel hierarchical matrices and low-rank tensors in spatial...Alexander Litvinenko
 
G-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action DetectionG-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action DetectionMengmeng Xu
 
Practical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentPractical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentRaphael Reitzig
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfRTEFGDFGJU
 
Dynamic Parameterized Problems - Algorithms and Complexity
Dynamic Parameterized Problems - Algorithms and ComplexityDynamic Parameterized Problems - Algorithms and Complexity
Dynamic Parameterized Problems - Algorithms and Complexitycseiitgn
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R台灣資料科學年會
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakaoJunho Kim
 
VJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCNVJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCNDat Nguyen
 
Container and algorithms(C++)
Container and algorithms(C++)Container and algorithms(C++)
Container and algorithms(C++)JerryHe
 

Similar to OLC assembly overlap layout consensus (20)

My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018
 
Data sparse approximation of the Karhunen-Loeve expansion
Data sparse approximation of the Karhunen-Loeve expansionData sparse approximation of the Karhunen-Loeve expansion
Data sparse approximation of the Karhunen-Loeve expansion
 
Slides
SlidesSlides
Slides
 
Data sparse approximation of Karhunen-Loeve Expansion
Data sparse approximation of Karhunen-Loeve ExpansionData sparse approximation of Karhunen-Loeve Expansion
Data sparse approximation of Karhunen-Loeve Expansion
 
Graphical Model Selection for Big Data
Graphical Model Selection for Big DataGraphical Model Selection for Big Data
Graphical Model Selection for Big Data
 
Integration
IntegrationIntegration
Integration
 
A common unique random fixed point theorem in hilbert space using integral ty...
A common unique random fixed point theorem in hilbert space using integral ty...A common unique random fixed point theorem in hilbert space using integral ty...
A common unique random fixed point theorem in hilbert space using integral ty...
 
Application of parallel hierarchical matrices and low-rank tensors in spatial...
Application of parallel hierarchical matrices and low-rank tensors in spatial...Application of parallel hierarchical matrices and low-rank tensors in spatial...
Application of parallel hierarchical matrices and low-rank tensors in spatial...
 
G-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action DetectionG-TAD: Sub-Graph Localization for Temporal Action Detection
G-TAD: Sub-Graph Localization for Temporal Action Detection
 
Practical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentPractical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient Apportionment
 
2_1 Edit Distance.pptx
2_1 Edit Distance.pptx2_1 Edit Distance.pptx
2_1 Edit Distance.pptx
 
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdfreservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
reservoir-modeling-using-matlab-the-matalb-reservoir-simulation-toolbox-mrst.pdf
 
Dynamic Parameterized Problems - Algorithms and Complexity
Dynamic Parameterized Problems - Algorithms and ComplexityDynamic Parameterized Problems - Algorithms and Complexity
Dynamic Parameterized Problems - Algorithms and Complexity
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
GAN in_kakao
GAN in_kakaoGAN in_kakao
GAN in_kakao
 
VJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCNVJAI Paper Reading#3-KDD2019-ClusterGCN
VJAI Paper Reading#3-KDD2019-ClusterGCN
 
Container and algorithms(C++)
Container and algorithms(C++)Container and algorithms(C++)
Container and algorithms(C++)
 

Recently uploaded

OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2John Carlo Rollon
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxEran Akiva Sinbar
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxPABOLU TEJASREE
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 

Recently uploaded (20)

OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptxTwin's paradox experiment is a meassurement of the extra dimensions.pptx
Twin's paradox experiment is a meassurement of the extra dimensions.pptx
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptxBREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
BREEDING FOR RESISTANCE TO BIOTIC STRESS.pptx
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 

OLC assembly overlap layout consensus

  • 1. Overlap Layout Consensus assembly Ben Langmead You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com) and tell me briefly how you’re using them. For original Keynote files, email me.
  • 2. Real-world assembly methods Both handle unresolvable repeats by essentially leaving them out Fragments are contigs (short for contiguous) Unresolvable repeats break the assembly into fragments OLC: Overlap-Layout-Consensus assembly DBG: De Bruijn graph assembly a_long_long_long_time a_long_long_time a_long          long_time Assemble substrings with Greedy-SCS Assemble substrings with OLC or DBG
  • 3. Assembly alternatives Alternative 1: Overlap-Layout-Consensus (OLC) assembly Alternative 2: de Bruijn graph (DBG) assembly Overlap Layout Consensus Error correction de Bruijn graph Scaffolding Refine
  • 4. Overlap Layout Consensus Overlap Layout Consensus Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig
  • 5. Finding overlaps Can we be less naive than this? CTCTAGGCC TAGGCCCTC X: Y: Say l = 3 CTCTAGGCC TAGGCCCTC X: Y: Look for this in Y, going right-to-left Found it CTCTAGGCC TAGGCCCTC X: Y: Extend to left; in this case, we confirm that a length-6 prefix of Y matches a suffix of X We’re doing this for every pair of input strings
  • 6. Finding overlaps Can we use suffix trees for overlapping? Problem: Given a collection of strings S, for each string x in S find all overlaps involving a prefix of x and a suffix of another string y Hint: Build a generalized suffix tree of the strings in S
  • 7. Finding overlaps with suffix tree Generalized suffix tree for {“GACATA”,“ATAGAC”} GACATA$0ATAGAC$1 A 6 $0 C 13 $1 GAC TA 5 $0 C TA 9 GAC$ 1 1 ATA$0 11 $1 3 $0 7 GAC$ 1 2 ATA$0 12 $1 0 ATA$0 10 $1 4 $0 8 GAC$ 1 Say query = GACATA. From root, follow path labeled with query. Green edge implies length-3 suffix of second string equals length-3 prefix of queryATAGAC      |||      GACATA
  • 8. Finding overlaps with suffix tree Generalized suffix tree for {“GACATA”,“ATAGAC”} GACATA$0ATAGAC$1 A 6 $0 C 13 $1 GAC TA 5 $0 C TA 9 GAC$ 1 1 ATA$0 11 $1 3 $0 7 GAC$ 1 2 ATA$0 12 $1 0 ATA$0 10 $1 4 $0 8 GAC$ 1 For each string: Walk down from root and report any outgoing edge labeled with a separator. Each corresponds to a prefix/suffix match involving prefix of query string and suffix of string ending in the separator. Strategy: (1) Build tree (2)
  • 9. Finding overlaps with suffix tree Generalized suffix tree for {“GACATA”,“ATAGAC”} GACATA$0ATAGAC$1 A 6 $0 C 13 $1 GAC TA 5 $0 C TA 9 GAC$ 1 1 ATA$0 11 $1 3 $0 7 GAC$ 1 2 ATA$0 12 $1 0 ATA$0 10 $1 4 $0 8 GAC$ 1 GACATA      |||      ATAGAC GACATA          |          ATAGAC ATAGAC      |||      GACATA Now let query be second string: ATAGAC
  • 10. Finding overlaps with suffix tree A 6 $0 C 13 $1 GAC TA 5 $0 C TA 9 GAC$ 1 1 ATA$0 11 $1 3 $0 7 GAC$ 1 2 ATA$0 12 $1 0 ATA$0 10 $1 4 $0 8 GAC$ 1 Say there are d reads of length n, total length N = dn, and a = # read pairs that overlap Time to build generalized suffix tree: O(N) ... to walk down red paths: O(N) ... to find & report overlaps (green): O(a) Overall: O(N + a) d2 doesn’t appear explicitly, but a is O(d2) in worst case Assume for given string pair we report only the longest suffix/prefix match
  • 11. Finding overlaps What if we want to allow mismatches and gaps in the overlap? CTCGGCCCTAGG      |||  |||||      GGCTCTAGGCCC X: Y:I.e. How do we find the best alignment of a suffix of X to a prefix of Y? Dynamic programming But we must frame the problem such that only backtraces involving a suffix of X and a prefix of Y are allowed
  • 12. Finding overlaps with dynamic programming CTCGGCCCTAGG      |||  |||||      GGCTCTAGGCCC X: Y: We’ll use global alignment recurrence and score function Find the best alignment of a suffix of X to a prefix of Y A C G T -­‐ A 0 4 2 4 8 C 4 0 4 2 8 G 2 4 0 4 8 T 4 2 4 0 8 -­‐ 8 8 8 8 s(a, b) D[i, j] = min 8 < : D[i 1, j] + s(x[i 1], ) D[i, j 1] + s( , y[j 1]) D[i 1, j 1] + s(x[i 1], y[j 1]) But how do we force it to find prefix / suffix matches?
  • 13. Finding overlaps with dynamic programming Find the best alignment of a suffix of X to a prefix of Y A C G T -­‐ A 0 4 2 4 8 C 4 0 4 2 8 G 2 4 0 4 8 T 4 2 4 0 8 -­‐ 8 8 8 8 s(a, b) D[i, j] = min 8 < : D[i 1, j] + s(x[i 1], ) D[i, j 1] + s( , y[j 1]) D[i 1, j 1] + s(x[i 1], y[j 1]) -­‐ G G C T C T A G G C C C -­‐ C T C G G C C C T A G G X Y How to initialize first row & column so suffix of X aligns to prefix of Y? 0 0 0 0 0 0 0 0 0 0 0 0 0 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ First column gets 0s (any suffix of X is possible) First row gets ∞s (must be a prefix of Y) 4 12 20 28 36 44 52 60 68 76 84 92 4 8 14 20 28 36 44 52 60 68 76 84 4 8 8 16 20 28 36 44 52 60 68 76 0 4 12 12 20 24 30 36 44 52 60 68 0 0 8 16 16 24 26 30 36 44 52 60 4 4 0 8 16 18 26 30 34 36 44 52 4 8 4 2 8 16 22 30 34 34 36 44 4 8 8 6 2 10 18 26 34 34 34 36 4 8 10 8 8 2 10 18 26 34 36 36 2 6 12 14 12 10 2 10 18 26 34 40 0 2 10 16 18 16 10 0 10 18 26 34 0 0 6 14 20 22 18 10 2 10 18 26 CTCGGCCCTAGG      |||  |||||      GGCTCTAGGCCC X: Y: Backtrace from last row
  • 14. Finding overlaps with dynamic programming Find the best alignment of a suffix of X to a prefix of Y A C G T -­‐ A 0 4 2 4 8 C 4 0 4 2 8 G 2 4 0 4 8 T 4 2 4 0 8 -­‐ 8 8 8 8 s(a, b) D[i, j] = min 8 < : D[i 1, j] + s(x[i 1], ) D[i, j 1] + s( , y[j 1]) D[i 1, j 1] + s(x[i 1], y[j 1]) -­‐ G G C T C T A G G C C C -­‐ C T C G G C C C T A G G X Y Problem: very short matches got high scores by chance... 0 0 0 0 0 0 0 0 0 0 0 0 0 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 4 12 20 28 36 44 52 60 68 76 84 92 4 8 14 20 28 36 44 52 60 68 76 84 4 8 8 16 20 28 36 44 52 60 68 76 0 4 12 12 20 24 30 36 44 52 60 68 0 0 8 16 16 24 26 30 36 44 52 60 4 4 0 8 16 18 26 30 34 36 44 52 4 8 4 2 8 16 22 30 34 34 36 44 4 8 8 6 2 10 18 26 34 34 34 36 4 8 10 8 8 2 10 18 26 34 36 36 2 6 12 14 12 10 2 10 18 26 34 40 0 2 10 16 18 16 10 0 10 18 26 34 0 0 6 14 20 22 18 10 2 10 18 26 ...which might obscure the more relevant match Say we want to enforce minimum overlap length l = 5
  • 15. Finding overlaps with dynamic programming Find the best alignment of a suffix of X to a prefix of Y A C G T -­‐ A 0 4 2 4 8 C 4 0 4 2 8 G 2 4 0 4 8 T 4 2 4 0 8 -­‐ 8 8 8 8 s(a, b) D[i, j] = min 8 < : D[i 1, j] + s(x[i 1], ) D[i, j 1] + s( , y[j 1]) D[i 1, j 1] + s(x[i 1], y[j 1]) -­‐ G G C T C T A G G C C C -­‐ 0 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ C 0 4 12 20 28 36 44 52 60 68 76 84 92 T 0 4 8 14 20 28 36 44 52 60 68 76 84 C 0 4 8 8 16 20 28 36 44 52 60 68 76 G 0 0 4 12 12 20 24 30 36 44 52 60 68 G 0 0 0 8 16 16 24 26 30 36 44 52 60 C 0 4 4 0 8 16 18 26 30 34 36 44 52 C 0 4 8 4 2 8 16 22 30 34 34 36 44 C 0 4 8 8 6 2 10 18 26 34 34 34 36 T ∞ 4 8 10 8 8 2 10 18 26 34 36 36 A ∞ 12 6 12 14 12 10 2 10 18 26 34 40 G ∞ 20 12 10 16 18 16 10 0 10 18 26 34 G ∞ ∞ ∞ ∞ ∞ 20 22 18 10 2 10 18 26 X Y Solve by initializing certain additional cells to ∞ Cells whose values changed highlighted in red Now the relevant match is the best candidate
  • 16. Finding overlaps with dynamic programming Number of overlaps to try: O(d2) Size of each dynamic programming matrix: O(n2) Overall: O(d2n2) = O(N2) Say there are d reads of length n, total length N = dn, and a is total number of pairs with an overlap Contrast O(N2) with suffix tree: O(N + a), but where a is worst-case O(d2) But dynamic programming is more flexible, allowing mismatches and gaps Real-world overlappers mix the two, using indexes to filter out vast majority of non-overlapping pairs, then using dynamic programming for remaining pairs
  • 17. Finding overlaps Overlapping is typically the slowest part of assembly Consider a second-generation sequencing dataset with hundreds of millions or billions of reads! Approaches from alignment unit can be adapted to finding overlaps Could also have adapted efficient exact matching, approximate string matching, co-traversal, ... We saw adaptations of naive exact matching, suffix-tree- assisted exact matching, and dynamic programming
  • 18. Finding overlaps http://wgs-assembler.sourceforge.net/wiki/index.php/RunCA#Overlapper Celera Assembler’s overlapper is probably the best documented: Inverted substring indexes built on batches of reads Only look for overlaps between reads that share one or more substrings of some length
  • 19. Overlap Layout Consensus Overlap Layout Consensus Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig
  • 20. Layout Overlap graph is big and messy. Contigs don’t“pop out”at us. Below: part of the overlap graph for to_every_thing_turn_turn_turn_there_is_a_season l = 4, k = 7 ry_thin thing_t 4 _thing_ 5y_thing 6 urn_tur rn_turn 6 _turn_t 4 n_turn_ 5 5 6 turn_tu 4 turn_th 4 here_is e_is_a_ 4 ere_is_ 6 re_is_a 5 ing_tur 5hing_tu 6 ng_turn 4 urn_the _there_ 4 n_there 5rn_ther 6 5 4 there_i 6 4 6 4 5 6 6 4 5 5 4 5 4 66 4 6 5 4 6 g_turn_ 5 5 6 4 5 6 4 6 5 4 4 4 6 55 6 5 4 5 44 6 5 6 4 6 4 5 6 5 4 6 4 5 4 4 6 55
  • 21. ry_thin 4 _thing_ 5y_thing 6 very_th 5ery_thi 6 4 _every_ 5 4 every_t 6 6 4 5 4 6 5 o_every 4 6 5 5 6 to_ever 5 4 6 Layout Anything redundant about this part of the overlap graph? Some edges can be inferred (transitively) from other edges E.g. green edge can be inferred from blue
  • 22. Layout Remove transitively-inferrible edges, starting with edges that skip one node: ry_thin thing_t 4 _thing_ 5y_thing 6 urn_tur rn_turn 6 _turn_t 4 n_turn_ 5 5 6 turn_tu 4 turn_th 4 here_is e_is_a_ 4 ere_is_ 6 re_is_a 5 ing_tur 5hing_tu 6 ng_turn 4 urn_the _there_ 4 n_there 5rn_ther 6 5 4 there_i 6 4 6 4 5 6 6 4 5 5 4 5 4 66 4 6 5 4 6 g_turn_ 5 5 6 4 5 6 4 6 5 4 4 4 6 55 6 5 4 5 44 6 5 6 4 6 4 5 6 5 4 6 4 5 4 4 6 55 Before: x
  • 25. Layout Emit contigs corresponding to the non-branching stretches ry_thin y_thing 6 urn_tur rn_turn 6 a_seaso _season 6 n_turn_ 6 here_is ere_is_ 6 thing_t hing_tu 6 urn_the rn_ther 6 _there_ there_i 6 very_th ery_thi 6 _every_ every_t 6 _thing_ 6 e_is_a_ _is_a_s 6 6 6 _turn_t turn_tu 6 turn_th 6 n_there 6 o_every 6 ing_tur ng_turn 6 re_is_a 6 is_a_se s_a_sea 6 6 6 4 6 6 _a_seas 6 g_turn_ 6 to_ever 6 6 6 4 6 6 6 4 6 to_every_thing_turn_ turn_there_is_a_season Contig 1 Contig 2 Unresolvable repeat
  • 26. Layout In practice, layout step also has to deal with spurious subgraphs, e.g. because of sequencing error Possible repeat boundary Mismatcha b Mismatch could be due to sequencing error or repeat. Since the path through b ends abruptly we might conclude it’s an error and prune b. ... a bprune
  • 27. Overlap Layout Consensus Overlap Layout Consensus Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig
  • 28. Consensus Take reads that make up a contig and line them up At each position, ask: what nucleotide (and/or gap) is here? Complications: (a) sequencing error, (b) ploidy Say the true genotype is AG, but we have a high sequencing error rate and only about 6 reads covering the position. TAGATTACACAGATTACTGA  TTGATGGCGTAA  CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG  TTACACAGATTATTGACTTCATGGCGTAA  CTA TAGATTACACAGATTACTGACTTGATGGCGTAA  CTA TAGATTACACAGATTACTGACTTGATGGCGTAA  CTA TAGATTACACAGATTACTGACTTGATGGCGTAA  CTA Take consensus, i.e. majority vote
  • 29. Overlap Layout Consensus Overlap Layout Consensus Build overlap graph Bundle stretches of the overlap graph into contigs Pick most likely nucleotide sequence for each contig OLC drawbacks Building overlap graph is slow. We saw O(N + a) and O(N2) approaches. 2nd-generation sequencing datasets are ~ 100s of millions or billions of reads, hundreds of billions of nucleotides total Overlap graph is big; one node per read, and in practice # edges grows superlinearly with # reads
  • 30. Assembly alternatives Alternative 1: Overlap-Layout-Consensus (OLC) assembly Alternative 2: de Bruijn graph (DBG) assembly Overlap Layout Consensus Error correction de Bruijn graph Scaffolding Refine