de Bruijn Graph Construction from Combination of Short and Long Reads

de Bruijn Graph
Construction from
Combination of
Short and Long Reads
CSE 6406 : Bioinformatics Algorithms
Course Faculty: Dr. Atif Hasan Rahman

Group Members
KAZI LUTFUL KABIR (1015052067)
SIKDER TAHSIN AL-AMIN (1015052076)
MD MAHABUR RAHMAN (1015052016)

Outline
 Common Terminology
 Motivation
 de Bruijn Graph
 A- Bruijn Graph
 Finding Genomic Path
 Error Correction in Draft Genome
 Potential Scopes of Development

Common Terminology
 Read: A read refers to the sequence of a cluster that is obtained after the end of the sequencing
process which is ultimately the sequence of a section of a unique fragment
 Contig: A set of reads related to each other by overlap of their sequence
 Genomic Path: A path in the assembly graph that corresponds to traversing the genome
 Draft genome: Sequence of genomic DNA having lower accuracy than finished sequence
-some segments are missing or in the wrong order or orientation
 Tip: An error occurred during the sequencing process causing the graph to end prematurely having
both correct and incorrect k-mers.
 Bubble: An error occurred during the sequence reading process such that there is a path for the k-mer
reads to reconnect with the main graph

Limitations of Classical deBruijn Graph
 Imperfect coverage of genome by reads (every k-mer from the genome is
represented by a read)
 Reads are error-prone
 Multiplicities of k-mers are unknown
 Distances between reads within the read-pairs are inexact

Motivation
 Implicit Assumption: de Bruijn-Inapplicable for long reads assembly
 Misunderstanding: de Bruijn graph can only assemble highly
accurate reads & fails in case(s) of error-prone SMRT reads
 Assumption: de Bruijn Approach limited to short and accurate
reads and OLC is the only way to assemble long error prone reads
 Original version of de Bruijn Approach is far away from being
optimal with respect to genome assembly problem

de Bruijn Graph Demonstration
 de Bruijn graph DB(Str, k) of a string Str :-
Path(Str, k) :a path of |Str| - k + 1 edges
where, i-th edge : i-th k-mer in Str
i-th vertex : i-th (k-1)-mer in Str
Glue identical vertices in Path(Str, k)
 A circular string,
Str = CATCAGATAGGA
3-mers : CAT, ATC, TCA, CAG,………..
For, edge CAT, CA and AT are the
constituent vertices

A-Bruijn Graph
 A variation of de Bruijn graph approach
 More general approach than de Bruijn
 Include breakpoint graphs- a major arena of genome
rearrangement study

A-Bruijn Graph Demonstration
 An arbitrary substring-free set of strings, V (a set of solid strings)
V consists of words (of any length)
-Path(Str, V ) : a path through all words from V appearing in Str (in order)
-Assign integer shift(v,w) to the edge (v,w) in this path to denote the
difference between the positions of v and w in Str
 Glue identically labeled vertices as to construct the A-Bruijn graph AB(Str, V)
 AB(Str, V) is generalized to AB(Reads, V)
- A path for each read
- Glue all identical vertices in all paths
- An Eulerian path in AB(Reads,V) spells out the genome
 Selecting an appropriate set of solid strings : a crucial factor

A-Bruijn Graph Demonstration
 A circular string,
Str = CATCAGATAGGA
 Set of solid strings, V=
{ CA, AT, TC, AGA, TA, AGG, AC }
 Integer shift AGA→ AT : 2
CATCAGATAGGA
CATCAGATAGGA

Solid String Selection
 Short Illumina reads and long SMRT reads differ in terms of their resultant
A-Bruijn graph
 Short Illumina read: resultant graph can be analyzed further after application of graph
simplification procedures (bubble and tip removal)
- not applicable for long SMRT reads (with error rate > 10%)
 Good Candidate for solid string: k-mers that appear frequently in reads
- (k,t)-mer : k-mer that has appeared at least t times
- for a typical bacterial SMRT assembly, k=15 and t=8 (default choice)

Finding Genomic Path in A-Bruijn Graph
 hybridSPAdes Algorithm (for co-assembling short and long reads):
1. Constructing the assembly graph from short reads using SPAdes
2. Mapping long reads to the assembly graph and generating readpaths
3. Closing gaps in the assembly graph using the consensus of longreads that
span the gaps
4. Resolving repeats in the assembly graph by incorporating long read-paths
into the decision rule of EXSPANDER (a repeat resolution framework)

Finding Genomic Path in A-Bruijn Graph
 SPAdes Algorithm :
(1) Assembly graph construction: de Bruijn graph simplification
(2) k-bimer adjustment: accurate distance estimation between k-mers
in the genome
(3) Construction of the paired assembly graph: PDBG approach
(4) Contig construction: backtracking graph simplification
 hybridSPAdes vs longSPAdes:
hybrid: deBruijn graph on k-mers from shortreads
long: A-Bruijn graph on (k,t)-mers from longreads

ABruijn Assembler
 Attempts to find a genomic path in the original A-Bruijn graph (instead of simplified one)
 In the context of A-Bruijn graph, it is difficult to decide whether two reads overlap or not
 Parameters of longSPAdes in new contexts
 Some additional parameters along with those of longSPAdes

Matching reads against draft genome
 ABruijn uses BLASR to align all reads against draft genome.
 It further combines pairwise alignments of all reads into a
multiple alignment, Alignment.
 Since this is inaccurate for error-prone draft genome, we need
to modify it.

Matching reads against draft genome
Our goal is to partition multiple alignment reads into
thousands of short segments
- Called Mini-Alignments
And error correct each segment.
- As error correction methods are fast for short segments
However, constructing mini-alignments is not simple

Defining solid regions in draft genome
Non-reference positionReference position

Cov(i) = Total number of reads covering a position

Match(i)= if read matches with reference column

Del(i) = number of space symbol in the column

Sub (i) = number of substituted symbol

Ins(i) = number of non-space symbol in non-reference column

Cov(i) = Match (i) + Del (i) + Sub(i)
Match rate= Match(i) / Cov(i)
Deletion rate= Del(i) / Cov(i)
Substitution rate= Sub(i) / Cov(i)

 For a given l-mer,
- Local Match rate= minimum match rate
- Local Insertion rate= maximum insertion rate
 l-mer is called (α, β) solid if –
 α<Local match rate &
 β> =Local Insertion rate

 Taking (α, β) = (0.8,0.2)

 The contiguous sequence of (α, β)-solid l-mers forms a solid
region.
 The goal now is to select a position (landmark) within each
solid region and to form mini-alignments from the segments of
reads.

Breaking multiple alignment into mini-
alignments
Another A-Bruijn graph with much simpler bubbles is
constructed using (α, β)-solid l-mers.
First landmarks are selected outside homonucleotide
runs.

Selecting landmarks
 4-mer
- CAGT – Gold //all its nucleotides are different
-ATGA – Simple //consecutive nucleotides different
 Landmarks- Middle points (2nd and 3rd Nucleotides)
 ABruijn analyzes each mini-alignment and error corrects each
segment between consecutive landmarks.

Constructing the A-Bruijn graph on solid
regions in the draft genome
 Each solid region containing a landmark is labeled by its landmark position and
break each read into a sequence of segments.
 Each read is represented as a directed path through the vertices.

To construct the A-Bruijn graph AB(Alignment), all
identically labeled vertices are glued together.

 The edges between two consecutive landmarks form a
necklace.
 If the length of the necklace is long (exceeds 100bp) , Abruijn
reduces it by increasing number of necklaces.

Probabilistic model for necklace polishing
Neklace contains read-segmets
- Segments={𝑠𝑒𝑔1, 𝑠𝑒𝑔2,….,𝑠𝑒𝑔 𝑛}
Find a consensus sequence that maximizes
Pr 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠 =
𝑖=1
𝑚
Pr 𝑠𝑒𝑔𝑖 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠
Where Pr 𝑠𝑒𝑔𝑖 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠 = product of all match, mismatch,
insertion, deletion rates for all positions

Probabilistic model for necklace polishing
 Start from initial necklace sequence
 Iteratively checks if a mutation exits that increases
Pr 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠
 Select the mutation that results maximum increase
 Iterate until convergence

Error-correcting Homonucleotide runs
 The performance of the probabilistic approach deteriorates
when it estimates the lengths of homonucleotide runs.
 Thus a homonucleotide likelihood function is introduced
based on the statistics of homonucleotide runs.

 To generate the statistics, an arbitrary set of reads is needed.
 The aligned segment is represented simply as the set of its
nucleotide counts.
-For ex, AATTACA = 4A1C2T.
 After all runs in the reference genome, the statistics for all
read segments are obtained.

 The frequencies are used for computing the likelihood
function as the product of these frequencies for all reads.
 To decide on the length of a homonucleotide run, the length
of the run that maximizes the likelihood function is selected.

For ex, Segments={5A, 6A, 6A, 7A, 6A1C}
-Pr(Segments|6A)=0.155 × 0.473^2 × 0.1 × 0.02 =0.0007
-Pr(Segments|7A)=0.049 × 0.154^2 × 0.418 × 0.022 = .00001
 So, select AAAAAA over AAAAAAA as the necklace
consensus.

Benchmarking
 Performed benchmarking of ABruijn and PBcR against the
reference E. coli K12 genome.
 ABruijn and PBcR differs from E.coli k12 reference genome in
2906 and 2925 positions respectively.
 Both agree on 2871.
- suggesting errors occurred.

Benchmarking
Remaining positions are focused

Benchmarking
 ABruijn also used to assemble the ECOLInano dataset.
 Assembler described in Loman et al. and ABruijn assembled
the ECOLInano dataset into a single circular contig with error
rates 1.5% and 1.1%, respectively.

Potential Scope of Development
Calculate Likelihood Ratio of
multiple solid string sets

Calculate likelihood ratio of multiple
solid string sets
Building a probability model
 Derive Solid String Sets for similar Genome known
Sequences
 Apply A-Bruijn approach to find the Solution
 Find the set which leads to approximate best solution

Calculate likelihood ratio of multiple
solid string sets
Building a probability model
 Derive a Relation between the optimal set and Long
Read Sequence
 Apply this Relation for unknown similar type of
Genome Sequence to assign the probabilistic value

Applying Bridging Effect

Applying Bridging Effect
In case of Long Read K-mer length
is bigger.
Difficult to detect correct branch

Applying Bridging Effect
Apply short Read Process before
Branching
Integrate the result with the Long
Read Sequence to detect correct
Branching

Walk on the Combined
Sequence

Merge Walking
Apply both Short Read & Long Read
Approach on Known Genome Read
Sequence
Result from Short Read Process
Result from Long Read Process

Merge Walking
Find the potentially overlapping
sequence
Sequence from Long
Read Process
Sequence from Short
Read Process
Overlapping area

Merge Walking
Build multiple Solution Set
combining both result
Each Solution in the Set must
contain the overlapped portion
Result from Short Read Process
Result from Long Read Process

Merge Walking
Compare the each solution with
known Genome Sequence
Form a Secondary Solution Set
which contains the similar optimal
solutions

Merge Walking
Align these solutions to both short read
and long read approach’s result
Detect the overlapped sequence
Find the characteristic of related
overlapped sequence

Merge Walking
For an unknown similar genome
sequence apply the obtained
characteristic to form a solution
combining both results

de Bruijn Graph Construction from Combination of Short and Long Reads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to de Bruijn Graph Construction from Combination of Short and Long Reads

Similar to de Bruijn Graph Construction from Combination of Short and Long Reads (20)

More from Sikder Tahsin Al-Amin

More from Sikder Tahsin Al-Amin (10)

Recently uploaded

Recently uploaded (20)

de Bruijn Graph Construction from Combination of Short and Long Reads