SlideShare a Scribd company logo
Sequence Assembly
&
MASURCA as a hybrid
approach
ZEHADY ABDULLAH KHAN
PhD 1st Year Student,
Department Of Computer Science,
Purdue University
DNA Sequencing
The process of determining the precise order of
neucleotides in a DNA molecule.
Overlap Layout Consensus
● Compute all pairwise overlaps between reads.
● Creates a layout
o An alignment of all overlapping reads.
● Extracts a consensus sequence
o by scanning the multiread alignment, column by column.
● Celera Assembler (Miller et al., 2008; Myers et al., 2000), PCAP (Huang, 2003), Arachne
(Batzoglou et al., 2002) and Phusion (Mullikin and Ning, 2003).
● Benefits:
o Flexibility with respect to read lengths
o Robustness to sequencing errors.
● Problem:
o Exponential computation.
The De Bruijin Graph
The De Bruijin Graph
● Allpaths-LG (Gnerre et al., 2010), SOAPdenovo (Li et al., 2008), Velvet (Zerbino and Birney,
2008), EULER-SR (Chaisson and Pevzner, 2008) and ABySS (Simpson et al., 2009)
● Any path through the graph that visits every edge exactly once, formally known as an Eulerian
path, forms a draft assembly of the read.
o Given all reads are perfect, which will match the de Bruijn graph of the genome
● Practically reads are not perfect.
● These graphs are complex with many intersecting cycles, and many alternative Eulerian paths
The De Bruijin Graph
What is Masurca?
● A new hybrid method
o Computational efficiency of the De Bruijn graph
method.
o Flexibility of the OLC assembly
Super Read
● Extend each original read forwards and backwards, base by base, as long as the extension is
unique.
● k-mer count look-up table
o An efficient hash table
o Determine quickly how many times each k-mer occurs in our reads
● Given a k-mer found at the end of a read, there are four possible k-mers for the next k-mer.
o The strings formed by appending A, C, G or T to the last k-1 bases in the read
● If only one of the four possible k-mers occurs, we say the read has a unique following k-mer
and we append that base to the read.
k-Unitig
● A k-mer is called a k-mer simple if it has a unique preceding k-mer and a unique following k-
mer.
● A k-unitig is a string of maximal length such that every k-mer in it is simple except for the first
and the last.
● By the construction, no k-mer can belong to more than one k-unitig.
● If a read has a k-mer that occurs in a k-unitig, the read and the k-unitig can be aligned to one
another.
● Use individual reads to merge the k-unitigs that overlap them into a single longer super-read.
Super Reads from paired-reads
● If the reads are paired,
o We examine each pair of reads
o Map each read to the k-unitigs,
o Look for a unique path of k-unitigs connected by k-unitig overlaps that connects the two
reads.
o If we find such a path, then we extend both paired-end reads to a new super-read.
 Merge the k-unitigs on this unique path
Assembler in Masurca
● Modified version of the CABOG assembler
● Only super-reads used are maximal super-reads
o Those that are not exact substrings of another super-reads.
● Use of other data in assembly
o Jumping libraries
o 454 read data
o Sanger read data
o Mate pairs
● Coverage of the genome by maximal super-reads typically varies from 2–3x
o Independent of the raw read coverage
● MaSuRCA automatically chooses the k-mer size for creating super-reads.
Results: Choice of Genome
● Reference Paper:
The MaSuRCA genome assembler
Aleksey V. Zimin, Guillaume Marc ̧ ais, Daniela Puiu, Michael Roberts, Steven L. Salzbergand
James A. Yorke
● bacterium R.sphaeroides str. 2.4.1 (Rhodobacter)
● chromosome 16 of M.musculus lineage B6 (mouse)
Metrics of Evaluation
● N50 :
o The length for which the collection of all contigs of that length or longer contains at least
half of the sum of the lengths of all contigs.
● NGA50:
o The value N such that 50% of the finished sequence is contained in contigs whose
alignments to the finished sequence are of size N or larger .
Results
Results
The End

More Related Content

What's hot

Rna seq
Rna seqRna seq
Rna seq
Sean Davis
 
Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010
Leighton Pritchard
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
University of California, Davis
 
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesTools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Surya Saha
 
Overview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seqOverview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seq
Alireza Doustmohammadi
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
mikaelhuss
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
COST action BM1006
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
mikaelhuss
 
non coding RNA
non coding RNAnon coding RNA
non coding RNA
komal komalsapara
 
RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities
Paolo Dametto
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
BITS
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
BITS
 
Structural Variation Detection
Structural Variation DetectionStructural Variation Detection
Structural Variation Detection
Jennifer Shelton
 
Transcriptomics: A Tool for Plant Disease Management
Transcriptomics: A Tool for Plant Disease ManagementTranscriptomics: A Tool for Plant Disease Management
Transcriptomics: A Tool for Plant Disease Management
SHIVANI PATHAK
 
Snp genotyping
Snp genotypingSnp genotyping
Snp genotyping
shivendra kumar
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
Yaoyu Wang
 
Genotyping by sequencing
Genotyping by sequencingGenotyping by sequencing
Genotyping by sequencing
Bhavya Sree
 
Regulatory RNA
Regulatory RNARegulatory RNA
Regulatory RNA
Bhabani Panigrahy
 
Genome analysis2
Genome analysis2Genome analysis2
RNA interference (RNAi)
RNA interference (RNAi) RNA interference (RNAi)
RNA interference (RNAi)
KK CHANDEL
 

What's hot (20)

Rna seq
Rna seqRna seq
Rna seq
 
Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010Comparative Genomics and Visualisation BS32010
Comparative Genomics and Visualisation BS32010
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesTools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
 
Overview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seqOverview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seq
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
non coding RNA
non coding RNAnon coding RNA
non coding RNA
 
RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
Structural Variation Detection
Structural Variation DetectionStructural Variation Detection
Structural Variation Detection
 
Transcriptomics: A Tool for Plant Disease Management
Transcriptomics: A Tool for Plant Disease ManagementTranscriptomics: A Tool for Plant Disease Management
Transcriptomics: A Tool for Plant Disease Management
 
Snp genotyping
Snp genotypingSnp genotyping
Snp genotyping
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
 
Genotyping by sequencing
Genotyping by sequencingGenotyping by sequencing
Genotyping by sequencing
 
Regulatory RNA
Regulatory RNARegulatory RNA
Regulatory RNA
 
Genome analysis2
Genome analysis2Genome analysis2
Genome analysis2
 
RNA interference (RNAi)
RNA interference (RNAi) RNA interference (RNAi)
RNA interference (RNAi)
 

Similar to Masurca genome assembly with super reads

Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
Monica Munoz-Torres
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
Monica Munoz-Torres
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Torsten Seemann
 
AB-RNA-alignments-2010
AB-RNA-alignments-2010AB-RNA-alignments-2010
AB-RNA-alignments-2010
Paula Tataru
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
Bas van Breukelen
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
Computer Science Club
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
Arindam Ghosh
 
Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research Community
Monica Munoz-Torres
 
proteome.pptx
proteome.pptxproteome.pptx
proteome.pptx
MohamedHasan816582
 
27.docking protein-protein and protein-ligand
27.docking protein-protein and protein-ligand27.docking protein-protein and protein-ligand
27.docking protein-protein and protein-ligand
Abhijeet Kadam
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research community
Monica Munoz-Torres
 
Statistics for K-mer Based Splicing Analysis
Statistics for K-mer Based Splicing AnalysisStatistics for K-mer Based Splicing Analysis
Statistics for K-mer Based Splicing Analysis
Ruofei Du
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
Monica Munoz-Torres
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
Zeeshan Hanjra
 
[DSC Adria 23] Enes Deumic application of ai in genomics.pdf
[DSC Adria 23] Enes Deumic application of ai in genomics.pdf[DSC Adria 23] Enes Deumic application of ai in genomics.pdf
[DSC Adria 23] Enes Deumic application of ai in genomics.pdf
DataScienceConferenc1
 
Bioalgo 2012-01-gene-prediction-sim
Bioalgo 2012-01-gene-prediction-simBioalgo 2012-01-gene-prediction-sim
Bioalgo 2012-01-gene-prediction-sim
BioinformaticsInstitute
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
CSCJournals
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
Monica Munoz-Torres
 
BioINfo.pptx
BioINfo.pptxBioINfo.pptx
BioINfo.pptx
sasmitpandit1
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
Yaoyu Wang
 

Similar to Masurca genome assembly with super reads (20)

Apollo : A workshop for the Manakin Research Coordination Network
Apollo: A workshop for the Manakin Research Coordination NetworkApollo: A workshop for the Manakin Research Coordination Network
Apollo : A workshop for the Manakin Research Coordination Network
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
AB-RNA-alignments-2010
AB-RNA-alignments-2010AB-RNA-alignments-2010
AB-RNA-alignments-2010
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
 
Introduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research CommunityIntroduction to Apollo: A webinar for the i5K Research Community
Introduction to Apollo: A webinar for the i5K Research Community
 
proteome.pptx
proteome.pptxproteome.pptx
proteome.pptx
 
27.docking protein-protein and protein-ligand
27.docking protein-protein and protein-ligand27.docking protein-protein and protein-ligand
27.docking protein-protein and protein-ligand
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research community
 
Statistics for K-mer Based Splicing Analysis
Statistics for K-mer Based Splicing AnalysisStatistics for K-mer Based Splicing Analysis
Statistics for K-mer Based Splicing Analysis
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
[DSC Adria 23] Enes Deumic application of ai in genomics.pdf
[DSC Adria 23] Enes Deumic application of ai in genomics.pdf[DSC Adria 23] Enes Deumic application of ai in genomics.pdf
[DSC Adria 23] Enes Deumic application of ai in genomics.pdf
 
Bioalgo 2012-01-gene-prediction-sim
Bioalgo 2012-01-gene-prediction-simBioalgo 2012-01-gene-prediction-sim
Bioalgo 2012-01-gene-prediction-sim
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
 
BioINfo.pptx
BioINfo.pptxBioINfo.pptx
BioINfo.pptx
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 

More from Abdullah Khan Zehady

Paleo environmental bio-diversity macro-evolutionary data mining and deep lea...
Paleo environmental bio-diversity macro-evolutionary data mining and deep lea...Paleo environmental bio-diversity macro-evolutionary data mining and deep lea...
Paleo environmental bio-diversity macro-evolutionary data mining and deep lea...
Abdullah Khan Zehady
 
Data mining and_visualization_of_earth_history_datasets_to_find_cause_effect_...
Data mining and_visualization_of_earth_history_datasets_to_find_cause_effect_...Data mining and_visualization_of_earth_history_datasets_to_find_cause_effect_...
Data mining and_visualization_of_earth_history_datasets_to_find_cause_effect_...
Abdullah Khan Zehady
 
Change of Dynasty correlated with Climate across the world
Change of Dynasty correlated with Climate across the worldChange of Dynasty correlated with Climate across the world
Change of Dynasty correlated with Climate across the world
Abdullah Khan Zehady
 
Parallel convolutional neural network
Parallel  convolutional neural networkParallel  convolutional neural network
Parallel convolutional neural network
Abdullah Khan Zehady
 
Distributed representation of sentences and documents
Distributed representation of sentences and documentsDistributed representation of sentences and documents
Distributed representation of sentences and documents
Abdullah Khan Zehady
 
Tribeflow on bitcoin data
Tribeflow on bitcoin dataTribeflow on bitcoin data
Tribeflow on bitcoin data
Abdullah Khan Zehady
 
How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?
Abdullah Khan Zehady
 
Applying word vectors sentiment analysis
Applying word vectors sentiment analysisApplying word vectors sentiment analysis
Applying word vectors sentiment analysis
Abdullah Khan Zehady
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
Abdullah Khan Zehady
 
Bitcoin Multisig Transaction
Bitcoin Multisig TransactionBitcoin Multisig Transaction
Bitcoin Multisig Transaction
Abdullah Khan Zehady
 
Bitcoin ideas
Bitcoin ideasBitcoin ideas
Bitcoin ideas
Abdullah Khan Zehady
 
Bitcoin investments
Bitcoin investmentsBitcoin investments
Bitcoin investments
Abdullah Khan Zehady
 
Rudimentary bitcoin network analysis
Rudimentary bitcoin network analysisRudimentary bitcoin network analysis
Rudimentary bitcoin network analysis
Abdullah Khan Zehady
 
Rich gets richer-Bitcoin Network
Rich gets richer-Bitcoin NetworkRich gets richer-Bitcoin Network
Rich gets richer-Bitcoin Network
Abdullah Khan Zehady
 
Bitcoin tech talk @Purdue Bitcoin Club
Bitcoin tech talk @Purdue Bitcoin ClubBitcoin tech talk @Purdue Bitcoin Club
Bitcoin tech talk @Purdue Bitcoin Club
Abdullah Khan Zehady
 
Bitcoin Network Analysis
Bitcoin Network AnalysisBitcoin Network Analysis
Bitcoin Network Analysis
Abdullah Khan Zehady
 
Bitcoin & Bitcoin Mining
Bitcoin & Bitcoin MiningBitcoin & Bitcoin Mining
Bitcoin & Bitcoin Mining
Abdullah Khan Zehady
 
The true measure of success
The true measure of successThe true measure of success
The true measure of success
Abdullah Khan Zehady
 

More from Abdullah Khan Zehady (18)

Paleo environmental bio-diversity macro-evolutionary data mining and deep lea...
Paleo environmental bio-diversity macro-evolutionary data mining and deep lea...Paleo environmental bio-diversity macro-evolutionary data mining and deep lea...
Paleo environmental bio-diversity macro-evolutionary data mining and deep lea...
 
Data mining and_visualization_of_earth_history_datasets_to_find_cause_effect_...
Data mining and_visualization_of_earth_history_datasets_to_find_cause_effect_...Data mining and_visualization_of_earth_history_datasets_to_find_cause_effect_...
Data mining and_visualization_of_earth_history_datasets_to_find_cause_effect_...
 
Change of Dynasty correlated with Climate across the world
Change of Dynasty correlated with Climate across the worldChange of Dynasty correlated with Climate across the world
Change of Dynasty correlated with Climate across the world
 
Parallel convolutional neural network
Parallel  convolutional neural networkParallel  convolutional neural network
Parallel convolutional neural network
 
Distributed representation of sentences and documents
Distributed representation of sentences and documentsDistributed representation of sentences and documents
Distributed representation of sentences and documents
 
Tribeflow on bitcoin data
Tribeflow on bitcoin dataTribeflow on bitcoin data
Tribeflow on bitcoin data
 
How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?How to Create AltCoin(Alternative Cryptocurrency)?
How to Create AltCoin(Alternative Cryptocurrency)?
 
Applying word vectors sentiment analysis
Applying word vectors sentiment analysisApplying word vectors sentiment analysis
Applying word vectors sentiment analysis
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Bitcoin Multisig Transaction
Bitcoin Multisig TransactionBitcoin Multisig Transaction
Bitcoin Multisig Transaction
 
Bitcoin ideas
Bitcoin ideasBitcoin ideas
Bitcoin ideas
 
Bitcoin investments
Bitcoin investmentsBitcoin investments
Bitcoin investments
 
Rudimentary bitcoin network analysis
Rudimentary bitcoin network analysisRudimentary bitcoin network analysis
Rudimentary bitcoin network analysis
 
Rich gets richer-Bitcoin Network
Rich gets richer-Bitcoin NetworkRich gets richer-Bitcoin Network
Rich gets richer-Bitcoin Network
 
Bitcoin tech talk @Purdue Bitcoin Club
Bitcoin tech talk @Purdue Bitcoin ClubBitcoin tech talk @Purdue Bitcoin Club
Bitcoin tech talk @Purdue Bitcoin Club
 
Bitcoin Network Analysis
Bitcoin Network AnalysisBitcoin Network Analysis
Bitcoin Network Analysis
 
Bitcoin & Bitcoin Mining
Bitcoin & Bitcoin MiningBitcoin & Bitcoin Mining
Bitcoin & Bitcoin Mining
 
The true measure of success
The true measure of successThe true measure of success
The true measure of success
 

Masurca genome assembly with super reads

  • 1. Sequence Assembly & MASURCA as a hybrid approach ZEHADY ABDULLAH KHAN PhD 1st Year Student, Department Of Computer Science, Purdue University
  • 2. DNA Sequencing The process of determining the precise order of neucleotides in a DNA molecule.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10. Overlap Layout Consensus ● Compute all pairwise overlaps between reads. ● Creates a layout o An alignment of all overlapping reads. ● Extracts a consensus sequence o by scanning the multiread alignment, column by column. ● Celera Assembler (Miller et al., 2008; Myers et al., 2000), PCAP (Huang, 2003), Arachne (Batzoglou et al., 2002) and Phusion (Mullikin and Ning, 2003). ● Benefits: o Flexibility with respect to read lengths o Robustness to sequencing errors. ● Problem: o Exponential computation.
  • 11.
  • 12. The De Bruijin Graph
  • 13. The De Bruijin Graph ● Allpaths-LG (Gnerre et al., 2010), SOAPdenovo (Li et al., 2008), Velvet (Zerbino and Birney, 2008), EULER-SR (Chaisson and Pevzner, 2008) and ABySS (Simpson et al., 2009) ● Any path through the graph that visits every edge exactly once, formally known as an Eulerian path, forms a draft assembly of the read. o Given all reads are perfect, which will match the de Bruijn graph of the genome ● Practically reads are not perfect. ● These graphs are complex with many intersecting cycles, and many alternative Eulerian paths
  • 14. The De Bruijin Graph
  • 15. What is Masurca? ● A new hybrid method o Computational efficiency of the De Bruijn graph method. o Flexibility of the OLC assembly
  • 16. Super Read ● Extend each original read forwards and backwards, base by base, as long as the extension is unique. ● k-mer count look-up table o An efficient hash table o Determine quickly how many times each k-mer occurs in our reads ● Given a k-mer found at the end of a read, there are four possible k-mers for the next k-mer. o The strings formed by appending A, C, G or T to the last k-1 bases in the read ● If only one of the four possible k-mers occurs, we say the read has a unique following k-mer and we append that base to the read.
  • 17. k-Unitig ● A k-mer is called a k-mer simple if it has a unique preceding k-mer and a unique following k- mer. ● A k-unitig is a string of maximal length such that every k-mer in it is simple except for the first and the last. ● By the construction, no k-mer can belong to more than one k-unitig. ● If a read has a k-mer that occurs in a k-unitig, the read and the k-unitig can be aligned to one another. ● Use individual reads to merge the k-unitigs that overlap them into a single longer super-read.
  • 18. Super Reads from paired-reads ● If the reads are paired, o We examine each pair of reads o Map each read to the k-unitigs, o Look for a unique path of k-unitigs connected by k-unitig overlaps that connects the two reads. o If we find such a path, then we extend both paired-end reads to a new super-read.  Merge the k-unitigs on this unique path
  • 19. Assembler in Masurca ● Modified version of the CABOG assembler ● Only super-reads used are maximal super-reads o Those that are not exact substrings of another super-reads. ● Use of other data in assembly o Jumping libraries o 454 read data o Sanger read data o Mate pairs ● Coverage of the genome by maximal super-reads typically varies from 2–3x o Independent of the raw read coverage ● MaSuRCA automatically chooses the k-mer size for creating super-reads.
  • 20. Results: Choice of Genome ● Reference Paper: The MaSuRCA genome assembler Aleksey V. Zimin, Guillaume Marc ̧ ais, Daniela Puiu, Michael Roberts, Steven L. Salzbergand James A. Yorke ● bacterium R.sphaeroides str. 2.4.1 (Rhodobacter) ● chromosome 16 of M.musculus lineage B6 (mouse)
  • 21. Metrics of Evaluation ● N50 : o The length for which the collection of all contigs of that length or longer contains at least half of the sum of the lengths of all contigs. ● NGA50: o The value N such that 50% of the finished sequence is contained in contigs whose alignments to the finished sequence are of size N or larger .

Editor's Notes

  1. Half of the finished sequence is contained in contigs whose alignments to the finished sequence