SlideShare a Scribd company logo
de Bruijn Graph
Construction from
Combination of
Short and Long Reads
CSE 6406 : Bioinformatics Algorithms
Course Faculty: Dr. Atif Hasan Rahman
Group Members
KAZI LUTFUL KABIR (1015052067)
SIKDER TAHSIN AL-AMIN (1015052076)
MD MAHABUR RAHMAN (1015052016)
Outline
 Common Terminology
 Motivation
 de Bruijn Graph
 A- Bruijn Graph
 Finding Genomic Path
 Error Correction in Draft Genome
 Potential Scopes of Development
Common Terminology
 Read: A read refers to the sequence of a cluster that is obtained after the end of the sequencing
process which is ultimately the sequence of a section of a unique fragment
 Contig: A set of reads related to each other by overlap of their sequence
 Genomic Path: A path in the assembly graph that corresponds to traversing the genome
 Draft genome: Sequence of genomic DNA having lower accuracy than finished sequence
-some segments are missing or in the wrong order or orientation
 Tip: An error occurred during the sequencing process causing the graph to end prematurely having
both correct and incorrect k-mers.
 Bubble: An error occurred during the sequence reading process such that there is a path for the k-mer
reads to reconnect with the main graph
Limitations of Classical deBruijn Graph
 Imperfect coverage of genome by reads (every k-mer from the genome is
represented by a read)
 Reads are error-prone
 Multiplicities of k-mers are unknown
 Distances between reads within the read-pairs are inexact
Motivation
 Implicit Assumption: de Bruijn-Inapplicable for long reads assembly
 Misunderstanding: de Bruijn graph can only assemble highly
accurate reads & fails in case(s) of error-prone SMRT reads
 Assumption: de Bruijn Approach limited to short and accurate
reads and OLC is the only way to assemble long error prone reads
 Original version of de Bruijn Approach is far away from being
optimal with respect to genome assembly problem
de Bruijn Graph Demonstration
 de Bruijn graph DB(Str, k) of a string Str :-
Path(Str, k) :a path of |Str| - k + 1 edges
where, i-th edge : i-th k-mer in Str
i-th vertex : i-th (k-1)-mer in Str
Glue identical vertices in Path(Str, k)
 A circular string,
Str = CATCAGATAGGA
3-mers : CAT, ATC, TCA, CAG,………..
For, edge CAT, CA and AT are the
constituent vertices
de Bruijn Graph Construction
A-Bruijn Graph
 A variation of de Bruijn graph approach
 More general approach than de Bruijn
 Include breakpoint graphs- a major arena of genome
rearrangement study
A-Bruijn Graph Demonstration
 An arbitrary substring-free set of strings, V (a set of solid strings)
V consists of words (of any length)
-Path(Str, V ) : a path through all words from V appearing in Str (in order)
-Assign integer shift(v,w) to the edge (v,w) in this path to denote the
difference between the positions of v and w in Str
 Glue identically labeled vertices as to construct the A-Bruijn graph AB(Str, V)
 AB(Str, V) is generalized to AB(Reads, V)
- A path for each read
- Glue all identical vertices in all paths
- An Eulerian path in AB(Reads,V) spells out the genome
 Selecting an appropriate set of solid strings : a crucial factor
A-Bruijn Graph Demonstration
 A circular string,
Str = CATCAGATAGGA
 Set of solid strings, V=
{ CA, AT, TC, AGA, TA, AGG, AC }
 Integer shift AGA→ AT : 2
CATCAGATAGGA
CATCAGATAGGA
A-Bruijn Graph Construction
Solid String Selection
 Short Illumina reads and long SMRT reads differ in terms of their resultant
A-Bruijn graph
 Short Illumina read: resultant graph can be analyzed further after application of graph
simplification procedures (bubble and tip removal)
- not applicable for long SMRT reads (with error rate > 10%)
 Good Candidate for solid string: k-mers that appear frequently in reads
- (k,t)-mer : k-mer that has appeared at least t times
- for a typical bacterial SMRT assembly, k=15 and t=8 (default choice)
Finding Genomic Path in A-Bruijn Graph
 hybridSPAdes Algorithm (for co-assembling short and long reads):
1. Constructing the assembly graph from short reads using SPAdes
2. Mapping long reads to the assembly graph and generating readpaths
3. Closing gaps in the assembly graph using the consensus of longreads that
span the gaps
4. Resolving repeats in the assembly graph by incorporating long read-paths
into the decision rule of EXSPANDER (a repeat resolution framework)
Finding Genomic Path in A-Bruijn Graph
 SPAdes Algorithm :
(1) Assembly graph construction: de Bruijn graph simplification
(2) k-bimer adjustment: accurate distance estimation between k-mers
in the genome
(3) Construction of the paired assembly graph: PDBG approach
(4) Contig construction: backtracking graph simplification
 hybridSPAdes vs longSPAdes:
hybrid: deBruijn graph on k-mers from shortreads
long: A-Bruijn graph on (k,t)-mers from longreads
ABruijn Assembler
 Attempts to find a genomic path in the original A-Bruijn graph (instead of simplified one)
 In the context of A-Bruijn graph, it is difficult to decide whether two reads overlap or not
 Parameters of longSPAdes in new contexts
 Some additional parameters along with those of longSPAdes
Matching reads against draft genome
 ABruijn uses BLASR to align all reads against draft genome.
 It further combines pairwise alignments of all reads into a
multiple alignment, Alignment.
 Since this is inaccurate for error-prone draft genome, we need
to modify it.
Matching reads against draft genome
Our goal is to partition multiple alignment reads into
thousands of short segments
- Called Mini-Alignments
And error correct each segment.
- As error correction methods are fast for short segments
However, constructing mini-alignments is not simple
Defining solid regions in draft genome
Non-reference positionReference position
Defining solid regions in draft genome
Cov(i) = Total number of reads covering a position
Defining solid regions in draft genome
Match(i)= if read matches with reference column
Defining solid regions in draft genome
Del(i) = number of space symbol in the column
Defining solid regions in draft genome
Sub (i) = number of substituted symbol
Defining solid regions in draft genome
Ins(i) = number of non-space symbol in non-reference column
Defining solid regions in draft genome
Cov(i) = Match (i) + Del (i) + Sub(i)
Match rate= Match(i) / Cov(i)
Deletion rate= Del(i) / Cov(i)
Substitution rate= Sub(i) / Cov(i)
Defining solid regions in draft genome
 For a given l-mer,
- Local Match rate= minimum match rate
- Local Insertion rate= maximum insertion rate
 l-mer is called (α, β) solid if –
 α<Local match rate &
 β> =Local Insertion rate
Defining solid regions in draft genome
 Taking (α, β) = (0.8,0.2)
Defining solid regions in draft genome
 The contiguous sequence of (α, β)-solid l-mers forms a solid
region.
 The goal now is to select a position (landmark) within each
solid region and to form mini-alignments from the segments of
reads.
Breaking multiple alignment into mini-
alignments
Another A-Bruijn graph with much simpler bubbles is
constructed using (α, β)-solid l-mers.
First landmarks are selected outside homonucleotide
runs.
Selecting landmarks
 4-mer
- CAGT – Gold //all its nucleotides are different
-ATGA – Simple //consecutive nucleotides different
 Landmarks- Middle points (2nd and 3rd Nucleotides)
 ABruijn analyzes each mini-alignment and error corrects each
segment between consecutive landmarks.
Constructing the A-Bruijn graph on solid
regions in the draft genome
 Each solid region containing a landmark is labeled by its landmark position and
break each read into a sequence of segments.
 Each read is represented as a directed path through the vertices.
Constructing the A-Bruijn graph on solid
regions in the draft genome
To construct the A-Bruijn graph AB(Alignment), all
identically labeled vertices are glued together.
Constructing the A-Bruijn graph on solid
regions in the draft genome
 The edges between two consecutive landmarks form a
necklace.
 If the length of the necklace is long (exceeds 100bp) , Abruijn
reduces it by increasing number of necklaces.
Probabilistic model for necklace polishing
Neklace contains read-segmets
- Segments={𝑠𝑒𝑔1, 𝑠𝑒𝑔2,….,𝑠𝑒𝑔 𝑛}
Find a consensus sequence that maximizes
Pr 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠 =
𝑖=1
𝑚
Pr 𝑠𝑒𝑔𝑖 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠
Where Pr 𝑠𝑒𝑔𝑖 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠 = product of all match, mismatch,
insertion, deletion rates for all positions
Probabilistic model for necklace polishing
 Start from initial necklace sequence
 Iteratively checks if a mutation exits that increases
Pr 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠
 Select the mutation that results maximum increase
 Iterate until convergence
Error-correcting Homonucleotide runs
 The performance of the probabilistic approach deteriorates
when it estimates the lengths of homonucleotide runs.
 Thus a homonucleotide likelihood function is introduced
based on the statistics of homonucleotide runs.
Error-correcting Homonucleotide runs
 To generate the statistics, an arbitrary set of reads is needed.
 The aligned segment is represented simply as the set of its
nucleotide counts.
-For ex, AATTACA = 4A1C2T.
 After all runs in the reference genome, the statistics for all
read segments are obtained.
Error-correcting Homonucleotide runs
 The frequencies are used for computing the likelihood
function as the product of these frequencies for all reads.
 To decide on the length of a homonucleotide run, the length
of the run that maximizes the likelihood function is selected.
Error-correcting Homonucleotide runs
For ex, Segments={5A, 6A, 6A, 7A, 6A1C}
-Pr(Segments|6A)=0.155 × 0.473^2 × 0.1 × 0.02 =0.0007
-Pr(Segments|7A)=0.049 × 0.154^2 × 0.418 × 0.022 = .00001
 So, select AAAAAA over AAAAAAA as the necklace
consensus.
Benchmarking
 Performed benchmarking of ABruijn and PBcR against the
reference E. coli K12 genome.
 ABruijn and PBcR differs from E.coli k12 reference genome in
2906 and 2925 positions respectively.
 Both agree on 2871.
- suggesting errors occurred.
Benchmarking
Remaining positions are focused
Benchmarking
 ABruijn also used to assemble the ECOLInano dataset.
 Assembler described in Loman et al. and ABruijn assembled
the ECOLInano dataset into a single circular contig with error
rates 1.5% and 1.1%, respectively.
Potential Scope of Development
Calculate Likelihood Ratio of
multiple solid string sets
Calculate likelihood ratio of multiple
solid string sets
Building a probability model
 Derive Solid String Sets for similar Genome known
Sequences
 Apply A-Bruijn approach to find the Solution
 Find the set which leads to approximate best solution
Calculate likelihood ratio of multiple
solid string sets
Building a probability model
 Derive a Relation between the optimal set and Long
Read Sequence
 Apply this Relation for unknown similar type of
Genome Sequence to assign the probabilistic value
Potential Scope of Development
Applying Bridging Effect
Applying Bridging Effect
In case of Long Read K-mer length
is bigger.
Difficult to detect correct branch
Applying Bridging Effect
Apply short Read Process before
Branching
Integrate the result with the Long
Read Sequence to detect correct
Branching
Potential Scope of Development
Walk on the Combined
Sequence
Merge Walking
Apply both Short Read & Long Read
Approach on Known Genome Read
Sequence
Result from Short Read Process
Result from Long Read Process
Merge Walking
Find the potentially overlapping
sequence
Sequence from Long
Read Process
Sequence from Short
Read Process
Overlapping area
Merge Walking
Build multiple Solution Set
combining both result
Each Solution in the Set must
contain the overlapped portion
Result from Short Read Process
Result from Long Read Process
Merge Walking
Compare the each solution with
known Genome Sequence
Form a Secondary Solution Set
which contains the similar optimal
solutions
Merge Walking
Align these solutions to both short read
and long read approach’s result
Detect the overlapped sequence
Find the characteristic of related
overlapped sequence
Merge Walking
For an unknown similar genome
sequence apply the obtained
characteristic to form a solution
combining both results
Thank you

More Related Content

What's hot

3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 
Using and combining the different tools for predicting the pathogenicity of s...
Using and combining the different tools for predicting the pathogenicity of s...Using and combining the different tools for predicting the pathogenicity of s...
Using and combining the different tools for predicting the pathogenicity of s...Vall d'Hebron Institute of Research (VHIR)
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5BITS
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
Multiple Sequence Alignment by Shubham Kaushik
Multiple Sequence Alignment by Shubham KaushikMultiple Sequence Alignment by Shubham Kaushik
Multiple Sequence Alignment by Shubham KaushikShubham Kaushik
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
 
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptxPRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptxBulBulsTutorial
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticezahid6
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and PrinseqliteRavi Gandham
 
Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...butest
 
2016. daisuke tsugama. next generation sequencing (ngs) for plant research
2016. daisuke tsugama. next generation sequencing (ngs) for plant research2016. daisuke tsugama. next generation sequencing (ngs) for plant research
2016. daisuke tsugama. next generation sequencing (ngs) for plant researchFOODCROPS
 
Presentation of daa on approximation algorithm and vertex cover problem
Presentation of daa on approximation algorithm and vertex cover problem Presentation of daa on approximation algorithm and vertex cover problem
Presentation of daa on approximation algorithm and vertex cover problem sumit gyawali
 

What's hot (20)

Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
An introduction to R
An introduction to RAn introduction to R
An introduction to R
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Using and combining the different tools for predicting the pathogenicity of s...
Using and combining the different tools for predicting the pathogenicity of s...Using and combining the different tools for predicting the pathogenicity of s...
Using and combining the different tools for predicting the pathogenicity of s...
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Multiple Sequence Alignment by Shubham Kaushik
Multiple Sequence Alignment by Shubham KaushikMultiple Sequence Alignment by Shubham Kaushik
Multiple Sequence Alignment by Shubham Kaushik
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptxPRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
PRESENTATION MULTIPLE SEQUENCE ALIGNMENT.pptx
 
Perl Programming - 02 Regular Expression
Perl Programming - 02 Regular ExpressionPerl Programming - 02 Regular Expression
Perl Programming - 02 Regular Expression
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informatice
 
Aho corasick-lecture
Aho corasick-lectureAho corasick-lecture
Aho corasick-lecture
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and Prinseqlite
 
Dot matrix seminar
Dot matrix seminarDot matrix seminar
Dot matrix seminar
 
Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...Lecture 9 slides: Machine learning for Protein Structure ...
Lecture 9 slides: Machine learning for Protein Structure ...
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
2016. daisuke tsugama. next generation sequencing (ngs) for plant research
2016. daisuke tsugama. next generation sequencing (ngs) for plant research2016. daisuke tsugama. next generation sequencing (ngs) for plant research
2016. daisuke tsugama. next generation sequencing (ngs) for plant research
 
Presentation of daa on approximation algorithm and vertex cover problem
Presentation of daa on approximation algorithm and vertex cover problem Presentation of daa on approximation algorithm and vertex cover problem
Presentation of daa on approximation algorithm and vertex cover problem
 

Viewers also liked

How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets? ehsan sepahi
 
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Anton Alexandrov
 
Colored de Bruijn Graphs
Colored de Bruijn GraphsColored de Bruijn Graphs
Colored de Bruijn GraphsMarcos Castro
 
Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Sebastian Schmeier
 
De Bruijn Sequences for Fun and Profit
De Bruijn Sequences for Fun and ProfitDe Bruijn Sequences for Fun and Profit
De Bruijn Sequences for Fun and ProfitAleksandar Bradic
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and AssemblyShaun Jackman
 

Viewers also liked (10)

How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets?
 
20101209 dnaseq pevzner
20101209 dnaseq pevzner20101209 dnaseq pevzner
20101209 dnaseq pevzner
 
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
 
Colored de Bruijn Graphs
Colored de Bruijn GraphsColored de Bruijn Graphs
Colored de Bruijn Graphs
 
Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)
 
De Bruijn Sequences for Fun and Profit
De Bruijn Sequences for Fun and ProfitDe Bruijn Sequences for Fun and Profit
De Bruijn Sequences for Fun and Profit
 
Primer design task
Primer design taskPrimer design task
Primer design task
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and Assembly
 
Pcr primer design english version
Pcr primer design english versionPcr primer design english version
Pcr primer design english version
 
PCR Primer desining
PCR Primer desiningPCR Primer desining
PCR Primer desining
 

Similar to de Bruijn Graph Construction from Combination of Short and Long Reads

Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn GraphAshwani kumar
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentationaustinps
 
PERFORMANCE ASSESSMENT OF CHAOTIC SEQUENCE DERIVED FROM BIFURCATION DEPENDENT...
PERFORMANCE ASSESSMENT OF CHAOTIC SEQUENCE DERIVED FROM BIFURCATION DEPENDENT...PERFORMANCE ASSESSMENT OF CHAOTIC SEQUENCE DERIVED FROM BIFURCATION DEPENDENT...
PERFORMANCE ASSESSMENT OF CHAOTIC SEQUENCE DERIVED FROM BIFURCATION DEPENDENT...IJCNCJournal
 
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding AlgorithmFixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding AlgorithmCSCJournals
 
Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Modelsbutest
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07Computer Science Club
 
Complementing Computation with Visualization in Genomics
Complementing Computation with Visualization in GenomicsComplementing Computation with Visualization in Genomics
Complementing Computation with Visualization in GenomicsFrancis Rowland
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological dataKrish_ver2
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfH K Yoon
 
TurnerBottoneStanekNIPS2013
TurnerBottoneStanekNIPS2013TurnerBottoneStanekNIPS2013
TurnerBottoneStanekNIPS2013Clay Stanek
 
Blind PNLMS Adaptive Algorithm for SIMO FIR Channel Estimation
Blind PNLMS  Adaptive Algorithm for SIMO FIR Channel EstimationBlind PNLMS  Adaptive Algorithm for SIMO FIR Channel Estimation
Blind PNLMS Adaptive Algorithm for SIMO FIR Channel Estimationardodul
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...IRJET Journal
 
Bioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisBioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisSangeeta Das
 
Distributed_Array_Algos.pptx
Distributed_Array_Algos.pptxDistributed_Array_Algos.pptx
Distributed_Array_Algos.pptxf20170649g
 

Similar to de Bruijn Graph Construction from Combination of Short and Long Reads (20)

Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
 
PK_MTP_PPT
PK_MTP_PPTPK_MTP_PPT
PK_MTP_PPT
 
Protein Threading
Protein ThreadingProtein Threading
Protein Threading
 
PERFORMANCE ASSESSMENT OF CHAOTIC SEQUENCE DERIVED FROM BIFURCATION DEPENDENT...
PERFORMANCE ASSESSMENT OF CHAOTIC SEQUENCE DERIVED FROM BIFURCATION DEPENDENT...PERFORMANCE ASSESSMENT OF CHAOTIC SEQUENCE DERIVED FROM BIFURCATION DEPENDENT...
PERFORMANCE ASSESSMENT OF CHAOTIC SEQUENCE DERIVED FROM BIFURCATION DEPENDENT...
 
Ieee 2013 matlab abstracts part b
Ieee 2013 matlab abstracts part bIeee 2013 matlab abstracts part b
Ieee 2013 matlab abstracts part b
 
Kailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptxKailash(13EC35032)_mtp.pptx
Kailash(13EC35032)_mtp.pptx
 
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding AlgorithmFixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
Fixed Point Realization of Iterative LR-Aided Soft MIMO Decoding Algorithm
 
Pattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical ModelsPattern Recognition and Machine Learning : Graphical Models
Pattern Recognition and Machine Learning : Graphical Models
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
Complementing Computation with Visualization in Genomics
Complementing Computation with Visualization in GenomicsComplementing Computation with Visualization in Genomics
Complementing Computation with Visualization in Genomics
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
TurnerBottoneStanekNIPS2013
TurnerBottoneStanekNIPS2013TurnerBottoneStanekNIPS2013
TurnerBottoneStanekNIPS2013
 
Blind PNLMS Adaptive Algorithm for SIMO FIR Channel Estimation
Blind PNLMS  Adaptive Algorithm for SIMO FIR Channel EstimationBlind PNLMS  Adaptive Algorithm for SIMO FIR Channel Estimation
Blind PNLMS Adaptive Algorithm for SIMO FIR Channel Estimation
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
 
Bioinformatics_Sequence Analysis
Bioinformatics_Sequence AnalysisBioinformatics_Sequence Analysis
Bioinformatics_Sequence Analysis
 
Distributed_Array_Algos.pptx
Distributed_Array_Algos.pptxDistributed_Array_Algos.pptx
Distributed_Array_Algos.pptx
 

More from Sikder Tahsin Al-Amin

More from Sikder Tahsin Al-Amin (10)

Distance Estimation by Constructing The Virtual Ruler in Anisotropic Sensor N...
Distance Estimation by Constructing The Virtual Ruler in Anisotropic Sensor N...Distance Estimation by Constructing The Virtual Ruler in Anisotropic Sensor N...
Distance Estimation by Constructing The Virtual Ruler in Anisotropic Sensor N...
 
Graphs - Discrete Math
Graphs - Discrete MathGraphs - Discrete Math
Graphs - Discrete Math
 
Combinational Logic with MSI and LSI
Combinational Logic with MSI and LSICombinational Logic with MSI and LSI
Combinational Logic with MSI and LSI
 
Combinational Logic
Combinational LogicCombinational Logic
Combinational Logic
 
Simplification of Boolean Functions
Simplification of Boolean FunctionsSimplification of Boolean Functions
Simplification of Boolean Functions
 
Boolean algebra
Boolean algebraBoolean algebra
Boolean algebra
 
Problem Solving Basics
Problem Solving BasicsProblem Solving Basics
Problem Solving Basics
 
Cloud computing for education: A new dawn?
Cloud computing for education: A new dawn?Cloud computing for education: A new dawn?
Cloud computing for education: A new dawn?
 
Introduction to C++
Introduction to C++Introduction to C++
Introduction to C++
 
Fuzzy clustering of sentence
Fuzzy clustering of sentenceFuzzy clustering of sentence
Fuzzy clustering of sentence
 

Recently uploaded

Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfAyahmorsy
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfKamal Acharya
 
fundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionfundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionjeevanprasad8
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edgePaco Orozco
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageRCC Institute of Information Technology
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfKamal Acharya
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...Amil baba
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdfKamal Acharya
 
fluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answerfluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answerapareshmondalnita
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamDr. Radhey Shyam
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxMd. Shahidul Islam Prodhan
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxwendy cai
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdfKamal Acharya
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopEmre Günaydın
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdfKamal Acharya
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationRobbie Edward Sayers
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsAtif Razi
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
 

Recently uploaded (20)

Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
fundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projectionfundamentals of drawing and isometric and orthographic projection
fundamentals of drawing and isometric and orthographic projection
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
Scaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltageScaling in conventional MOSFET for constant electric field and constant voltage
Scaling in conventional MOSFET for constant electric field and constant voltage
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdfA CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
A CASE STUDY ON ONLINE TICKET BOOKING SYSTEM PROJECT.pdf
 
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
NO1 Pandit Amil Baba In Bahawalpur, Sargodha, Sialkot, Sheikhupura, Rahim Yar...
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
fluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answerfluid mechanics gate notes . gate all pyqs answer
fluid mechanics gate notes . gate all pyqs answer
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data StreamKIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
Toll tax management system project report..pdf
Toll tax management system project report..pdfToll tax management system project report..pdf
Toll tax management system project report..pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical SolutionsRS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
RS Khurmi Machine Design Clutch and Brake Exercise Numerical Solutions
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 

de Bruijn Graph Construction from Combination of Short and Long Reads

  • 1. de Bruijn Graph Construction from Combination of Short and Long Reads CSE 6406 : Bioinformatics Algorithms Course Faculty: Dr. Atif Hasan Rahman
  • 2. Group Members KAZI LUTFUL KABIR (1015052067) SIKDER TAHSIN AL-AMIN (1015052076) MD MAHABUR RAHMAN (1015052016)
  • 3. Outline  Common Terminology  Motivation  de Bruijn Graph  A- Bruijn Graph  Finding Genomic Path  Error Correction in Draft Genome  Potential Scopes of Development
  • 4. Common Terminology  Read: A read refers to the sequence of a cluster that is obtained after the end of the sequencing process which is ultimately the sequence of a section of a unique fragment  Contig: A set of reads related to each other by overlap of their sequence  Genomic Path: A path in the assembly graph that corresponds to traversing the genome  Draft genome: Sequence of genomic DNA having lower accuracy than finished sequence -some segments are missing or in the wrong order or orientation  Tip: An error occurred during the sequencing process causing the graph to end prematurely having both correct and incorrect k-mers.  Bubble: An error occurred during the sequence reading process such that there is a path for the k-mer reads to reconnect with the main graph
  • 5. Limitations of Classical deBruijn Graph  Imperfect coverage of genome by reads (every k-mer from the genome is represented by a read)  Reads are error-prone  Multiplicities of k-mers are unknown  Distances between reads within the read-pairs are inexact
  • 6. Motivation  Implicit Assumption: de Bruijn-Inapplicable for long reads assembly  Misunderstanding: de Bruijn graph can only assemble highly accurate reads & fails in case(s) of error-prone SMRT reads  Assumption: de Bruijn Approach limited to short and accurate reads and OLC is the only way to assemble long error prone reads  Original version of de Bruijn Approach is far away from being optimal with respect to genome assembly problem
  • 7. de Bruijn Graph Demonstration  de Bruijn graph DB(Str, k) of a string Str :- Path(Str, k) :a path of |Str| - k + 1 edges where, i-th edge : i-th k-mer in Str i-th vertex : i-th (k-1)-mer in Str Glue identical vertices in Path(Str, k)  A circular string, Str = CATCAGATAGGA 3-mers : CAT, ATC, TCA, CAG,……….. For, edge CAT, CA and AT are the constituent vertices
  • 8. de Bruijn Graph Construction
  • 9. A-Bruijn Graph  A variation of de Bruijn graph approach  More general approach than de Bruijn  Include breakpoint graphs- a major arena of genome rearrangement study
  • 10. A-Bruijn Graph Demonstration  An arbitrary substring-free set of strings, V (a set of solid strings) V consists of words (of any length) -Path(Str, V ) : a path through all words from V appearing in Str (in order) -Assign integer shift(v,w) to the edge (v,w) in this path to denote the difference between the positions of v and w in Str  Glue identically labeled vertices as to construct the A-Bruijn graph AB(Str, V)  AB(Str, V) is generalized to AB(Reads, V) - A path for each read - Glue all identical vertices in all paths - An Eulerian path in AB(Reads,V) spells out the genome  Selecting an appropriate set of solid strings : a crucial factor
  • 11. A-Bruijn Graph Demonstration  A circular string, Str = CATCAGATAGGA  Set of solid strings, V= { CA, AT, TC, AGA, TA, AGG, AC }  Integer shift AGA→ AT : 2 CATCAGATAGGA CATCAGATAGGA
  • 13. Solid String Selection  Short Illumina reads and long SMRT reads differ in terms of their resultant A-Bruijn graph  Short Illumina read: resultant graph can be analyzed further after application of graph simplification procedures (bubble and tip removal) - not applicable for long SMRT reads (with error rate > 10%)  Good Candidate for solid string: k-mers that appear frequently in reads - (k,t)-mer : k-mer that has appeared at least t times - for a typical bacterial SMRT assembly, k=15 and t=8 (default choice)
  • 14. Finding Genomic Path in A-Bruijn Graph  hybridSPAdes Algorithm (for co-assembling short and long reads): 1. Constructing the assembly graph from short reads using SPAdes 2. Mapping long reads to the assembly graph and generating readpaths 3. Closing gaps in the assembly graph using the consensus of longreads that span the gaps 4. Resolving repeats in the assembly graph by incorporating long read-paths into the decision rule of EXSPANDER (a repeat resolution framework)
  • 15. Finding Genomic Path in A-Bruijn Graph  SPAdes Algorithm : (1) Assembly graph construction: de Bruijn graph simplification (2) k-bimer adjustment: accurate distance estimation between k-mers in the genome (3) Construction of the paired assembly graph: PDBG approach (4) Contig construction: backtracking graph simplification  hybridSPAdes vs longSPAdes: hybrid: deBruijn graph on k-mers from shortreads long: A-Bruijn graph on (k,t)-mers from longreads
  • 16. ABruijn Assembler  Attempts to find a genomic path in the original A-Bruijn graph (instead of simplified one)  In the context of A-Bruijn graph, it is difficult to decide whether two reads overlap or not  Parameters of longSPAdes in new contexts  Some additional parameters along with those of longSPAdes
  • 17. Matching reads against draft genome  ABruijn uses BLASR to align all reads against draft genome.  It further combines pairwise alignments of all reads into a multiple alignment, Alignment.  Since this is inaccurate for error-prone draft genome, we need to modify it.
  • 18. Matching reads against draft genome Our goal is to partition multiple alignment reads into thousands of short segments - Called Mini-Alignments And error correct each segment. - As error correction methods are fast for short segments However, constructing mini-alignments is not simple
  • 19. Defining solid regions in draft genome Non-reference positionReference position
  • 20. Defining solid regions in draft genome Cov(i) = Total number of reads covering a position
  • 21. Defining solid regions in draft genome Match(i)= if read matches with reference column
  • 22. Defining solid regions in draft genome Del(i) = number of space symbol in the column
  • 23. Defining solid regions in draft genome Sub (i) = number of substituted symbol
  • 24. Defining solid regions in draft genome Ins(i) = number of non-space symbol in non-reference column
  • 25. Defining solid regions in draft genome Cov(i) = Match (i) + Del (i) + Sub(i) Match rate= Match(i) / Cov(i) Deletion rate= Del(i) / Cov(i) Substitution rate= Sub(i) / Cov(i)
  • 26. Defining solid regions in draft genome  For a given l-mer, - Local Match rate= minimum match rate - Local Insertion rate= maximum insertion rate  l-mer is called (α, β) solid if –  α<Local match rate &  β> =Local Insertion rate
  • 27. Defining solid regions in draft genome  Taking (α, β) = (0.8,0.2)
  • 28. Defining solid regions in draft genome  The contiguous sequence of (α, β)-solid l-mers forms a solid region.  The goal now is to select a position (landmark) within each solid region and to form mini-alignments from the segments of reads.
  • 29. Breaking multiple alignment into mini- alignments Another A-Bruijn graph with much simpler bubbles is constructed using (α, β)-solid l-mers. First landmarks are selected outside homonucleotide runs.
  • 30. Selecting landmarks  4-mer - CAGT – Gold //all its nucleotides are different -ATGA – Simple //consecutive nucleotides different  Landmarks- Middle points (2nd and 3rd Nucleotides)  ABruijn analyzes each mini-alignment and error corrects each segment between consecutive landmarks.
  • 31. Constructing the A-Bruijn graph on solid regions in the draft genome  Each solid region containing a landmark is labeled by its landmark position and break each read into a sequence of segments.  Each read is represented as a directed path through the vertices.
  • 32. Constructing the A-Bruijn graph on solid regions in the draft genome To construct the A-Bruijn graph AB(Alignment), all identically labeled vertices are glued together.
  • 33. Constructing the A-Bruijn graph on solid regions in the draft genome  The edges between two consecutive landmarks form a necklace.  If the length of the necklace is long (exceeds 100bp) , Abruijn reduces it by increasing number of necklaces.
  • 34. Probabilistic model for necklace polishing Neklace contains read-segmets - Segments={𝑠𝑒𝑔1, 𝑠𝑒𝑔2,….,𝑠𝑒𝑔 𝑛} Find a consensus sequence that maximizes Pr 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠 = 𝑖=1 𝑚 Pr 𝑠𝑒𝑔𝑖 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠 Where Pr 𝑠𝑒𝑔𝑖 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠 = product of all match, mismatch, insertion, deletion rates for all positions
  • 35. Probabilistic model for necklace polishing  Start from initial necklace sequence  Iteratively checks if a mutation exits that increases Pr 𝑠𝑒𝑔𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑠𝑒𝑛𝑠𝑢𝑠  Select the mutation that results maximum increase  Iterate until convergence
  • 36. Error-correcting Homonucleotide runs  The performance of the probabilistic approach deteriorates when it estimates the lengths of homonucleotide runs.  Thus a homonucleotide likelihood function is introduced based on the statistics of homonucleotide runs.
  • 37. Error-correcting Homonucleotide runs  To generate the statistics, an arbitrary set of reads is needed.  The aligned segment is represented simply as the set of its nucleotide counts. -For ex, AATTACA = 4A1C2T.  After all runs in the reference genome, the statistics for all read segments are obtained.
  • 38.
  • 39. Error-correcting Homonucleotide runs  The frequencies are used for computing the likelihood function as the product of these frequencies for all reads.  To decide on the length of a homonucleotide run, the length of the run that maximizes the likelihood function is selected.
  • 40. Error-correcting Homonucleotide runs For ex, Segments={5A, 6A, 6A, 7A, 6A1C} -Pr(Segments|6A)=0.155 × 0.473^2 × 0.1 × 0.02 =0.0007 -Pr(Segments|7A)=0.049 × 0.154^2 × 0.418 × 0.022 = .00001  So, select AAAAAA over AAAAAAA as the necklace consensus.
  • 41. Benchmarking  Performed benchmarking of ABruijn and PBcR against the reference E. coli K12 genome.  ABruijn and PBcR differs from E.coli k12 reference genome in 2906 and 2925 positions respectively.  Both agree on 2871. - suggesting errors occurred.
  • 43. Benchmarking  ABruijn also used to assemble the ECOLInano dataset.  Assembler described in Loman et al. and ABruijn assembled the ECOLInano dataset into a single circular contig with error rates 1.5% and 1.1%, respectively.
  • 44. Potential Scope of Development Calculate Likelihood Ratio of multiple solid string sets
  • 45. Calculate likelihood ratio of multiple solid string sets Building a probability model  Derive Solid String Sets for similar Genome known Sequences  Apply A-Bruijn approach to find the Solution  Find the set which leads to approximate best solution
  • 46. Calculate likelihood ratio of multiple solid string sets Building a probability model  Derive a Relation between the optimal set and Long Read Sequence  Apply this Relation for unknown similar type of Genome Sequence to assign the probabilistic value
  • 47. Potential Scope of Development Applying Bridging Effect
  • 48. Applying Bridging Effect In case of Long Read K-mer length is bigger. Difficult to detect correct branch
  • 49. Applying Bridging Effect Apply short Read Process before Branching Integrate the result with the Long Read Sequence to detect correct Branching
  • 50. Potential Scope of Development Walk on the Combined Sequence
  • 51. Merge Walking Apply both Short Read & Long Read Approach on Known Genome Read Sequence Result from Short Read Process Result from Long Read Process
  • 52. Merge Walking Find the potentially overlapping sequence Sequence from Long Read Process Sequence from Short Read Process Overlapping area
  • 53. Merge Walking Build multiple Solution Set combining both result Each Solution in the Set must contain the overlapped portion Result from Short Read Process Result from Long Read Process
  • 54. Merge Walking Compare the each solution with known Genome Sequence Form a Secondary Solution Set which contains the similar optimal solutions
  • 55. Merge Walking Align these solutions to both short read and long read approach’s result Detect the overlapped sequence Find the characteristic of related overlapped sequence
  • 56. Merge Walking For an unknown similar genome sequence apply the obtained characteristic to form a solution combining both results