SlideShare a Scribd company logo
1 of 40
Download to read offline
Title
Sebastian Schmeier
s.schmeier@gmail.com
http://sschmeier.com/bioinf-workshop/
2016
•  De novo genome assembly
•  The fragment assembly problem
•  Shortest superstring problem
•  Seven bridges of Königsberg
•  Assembly as a graph theoretical problem
•  We construct a de Bruijn graph
•  Underlying assumptions of genome assemblies
2Sebastian Schmeier
•  The process of generating a new genome sequence from NGS genome
sequence reads based on assembly algorithms
•  Assembly involves joining short sequence fragments together into long
pieces – contigs
3
reads
genome
sequencing
Sebastian Schmeier
4
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Sebastian Schmeier
5
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Vertex
Sebastian Schmeier
6
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Vertex
Edge
Sebastian Schmeier
7
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Vertex
Edge
Sebastian Schmeier
8
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Vertex
Edge
Sebastian Schmeier
9
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Vertex
ATGCG					
GCGTG	
GTGGC	
TGGCA	
ATGCGTGGCA					
genome	
Edge
Sebastian Schmeier
•  Given:A set of reads (strings) {s1, s2, … , sn }
•  Do: Determine a large string s that “best explains” the reads
•  What do we mean by “best explains”?
•  What assumptions might we require?
10Sebastian Schmeier
•  Objective: Find a string s such that
•  all reads s1, s2, … , sn are substrings of s
•  s is as short as possible
•  Assumptions:
§  Reads are 100% accurate
§  Identical reads must come from the same location on the genome
§  “best” = “simplest”
11Sebastian Schmeier
12
ATGCG					
GCGTG	
GTGGC	
TGGCA	
1	
2	
3	
ATGCG					
GCGTG	
GTGGC	
TGGCA	
ATGCGTGGCA					
•  The assumption is that all substrings are represented
•  Even modern sequencers that generate 100nt reads do not cover all possible 100-mers
Sebastian Schmeier
13
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
Sebastian Schmeier
14
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
make	them	unique	
Sebastian Schmeier
15
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	 Draw	edge	from	x	to	y		
where	
suffix	from	x	overlaps	prefix	from	y	
Sebastian Schmeier
16
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	 Draw	edge	from	x	to	y		
where	
suffix	from	x	overlaps	prefix	from	y	
Sebastian Schmeier
17
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	 Draw	edge	from	x	to	y		
where	
suffix	from	x	overlaps	prefix	from	y	
Sebastian Schmeier
18
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	 Draw	edge	from	x	to	y		
where	
suffix	from	x	overlaps	prefix	from	y	
Sebastian Schmeier
19
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	
Find	Hamiltonian	path,	that	is,	a		
path	that	visits	every	vertex	exactly	once	
	
Record	the	First	le0er	of	each	vertex	+		
All	le0ers	of	last	vertex	
Sebastian Schmeier
20
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ß  UNFORTUNATELY:The Hamiltonian path problem is very difficult to solve (np-complete)
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	
Find	Hamiltonian	path,	that	is,	a		
path	that	visits	every	vertex	exactly	once	
	
Record	the	First	le0er	of	each	vertex	+		
All	le0ers	of	last	vertex	
ATGCGTGGCA					
Sebastian Schmeier
21
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ß  UNFORTUNATELY:The Hamiltonian path problem is very difficult to solve (np-complete)
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	
Find	Hamiltonian	path,	that	is,	a		
path	that	visits	every	vertex	exactly	once	
	
Record	the	First	le0er	of	each	vertex	+		
All	le0ers	of	last	vertex	
ATGCGTGGCA					
ATGGCGTGCA	
Sebastian Schmeier
•  In 1735 Leonhard Euler was presented with the following problem:
§  find a walk through the city that would cross each bridge once and only once
§  He proved that a connected graph with undirected edges contains an Eulerian cycle
exactly when every node in the graph has an even number of edges touching it.
§  For the Königsberg Bridge graph, this is not the case because each of the four nodes
has an odd number of edges touching it and so the desired stroll through the city does
not exist.
22
https://en.wikipedia.org/wiki/Leonhard_Euler
Sebastian Schmeier
23
A	 B	
D	
Vertex
Directed edge
•  The degree of a vertex: # of edges connected to it
•  outdegree: # of outgoing edges
•  indegree: # of ingoing edges
•  degree(B)?
•  outdegree(B)?
•  indegree(D)?
C	
Sebastian Schmeier
•  The case of directed graphs is similar:
§  A graph in which indegrees are equal to outdegrees for all nodes is
called 'balanced'.
§  Euler's theorem states that a connected directed graph has an Eulerian
cycle if and only if it is balanced.
•  Mathematically/computationally finding Eulerian path is much easier than
Hamiltonian
à we need to reformulate our assembly problem
24Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
25
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
26
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
27
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
28
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
29
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
30
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
31
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
32
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
Sebastian Schmeier
Can we find a DNA sequence containing all k-mers?
à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à  Eulerian path
•  a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1
•  a connected graph has an Eulerian path if and only if it contains at most two semibalanced
vertices
33
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
Sebastian Schmeier
Can we find a DNA sequence containing all k-mers?
à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à  Eulerian path
•  a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1
•  a connected graph has an Eulerian path if and only if it contains at most two semibalanced
vertices
34
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
Sebastian Schmeier
Can we find a DNA sequence containing all k-mers?
à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à  Eulerian path
35
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
ATGGCGTGCA	
Sebastian Schmeier
Can we find a DNA sequence containing all k-mers?
à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à  Eulerian path
36
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
ATGGCGTGCA	
ATGCGTGGCA	
Sebastian Schmeier
•  Four hidden assumptions that do not hold for next-generation sequencing
We took for granted that:
1.  we can generate all k-mers present in the genome
2.  all k-mers are error free
3.  each k-mer appears at most once in the genome
4.  the genome consists of a single chromosome
37Sebastian Schmeier
•  Four hidden assumptions that do not hold for next-generation sequencing
We took for granted that:
1.  we can generate all k-mers present in the genome
2.  all k-mers are error free
•  Due to these reasons we do NOT choose the longest possible k-mer
•  The smaller the k-mer the higher the possibility that we see all k-mers
•  Errors:
38
ATGGCGTGCA	
ATG			TGG			GGC			GCG			CGT			GTG			TGC			GCA	
Mostly	unaffected	k-mers	
ATGGCGTGCA	
100%	affected	k-mers	
K=3	
K=10	
Sebastian Schmeier
Each k-mer appears at most once in the genome à repeats
•  This is most often not true
•  This is known as k-mer multiplicity
39
k-mers:
ATG, GCA,
TGC,TGC, GTG, GTG, GCG, GCG, CGT, CGT
Distinct (k-1)-mers:
	
AT	 TG	 GC	 CA	
CG	GT	
ATG TGC
GTG GCG
CGT
GCA
ATGCGTGCGTGCA	
Sebastian Schmeier
‹#›	 40
References
How to apply de Bruijn graphs to genome assembly. Phillip E C Compeau, Pavel A Pevzner & GlennTesler. Nature Biotechnology
29, 987–991 (2011) doi:10.1038/nbt.2023 Published online 08 November 2011
Sequence Assembly. Lecture by Mark Craven (craven@biostat.wisc.edu). BMI/CS 576 (www.biostat.wisc.edu/bmi576/), Fall 2011
	
Sebastian Schmeier
s.schmeier@gmail.com
http://sschmeier.com/bioinf-workshop/

More Related Content

What's hot

Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomics
Nikhil Aggarwal
 
Comparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression AnalysisComparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression Analysis
Yaoyu Wang
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
Nikolay Vyahhi
 

What's hot (20)

Comparative genomics and proteomics
Comparative genomics and proteomicsComparative genomics and proteomics
Comparative genomics and proteomics
 
Comparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression AnalysisComparison between RNASeq and Microarray for Gene Expression Analysis
Comparison between RNASeq and Microarray for Gene Expression Analysis
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 
RNA Sequencing from Single Cell
RNA Sequencing from Single CellRNA Sequencing from Single Cell
RNA Sequencing from Single Cell
 
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic SequencesThe NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
The NCBI Eukaryotic Genome Annotation Pipeline and Alternate Genomic Sequences
 
Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
Alignments
AlignmentsAlignments
Alignments
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Biological networks
Biological networksBiological networks
Biological networks
 
NCBI
NCBINCBI
NCBI
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...
 
Omim
 Omim Omim
Omim
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Kegg
KeggKegg
Kegg
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approaches
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 

Viewers also liked

Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Anton Alexandrov
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and Assembly
Shaun Jackman
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
Dayananda Salam
 

Viewers also liked (17)

20101209 dnaseq pevzner
20101209 dnaseq pevzner20101209 dnaseq pevzner
20101209 dnaseq pevzner
 
How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets?
 
de Bruijn Graph Construction from Combination of Short and Long Reads
de Bruijn Graph Construction from Combination of Short and Long Readsde Bruijn Graph Construction from Combination of Short and Long Reads
de Bruijn Graph Construction from Combination of Short and Long Reads
 
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
 
Colored de Bruijn Graphs
Colored de Bruijn GraphsColored de Bruijn Graphs
Colored de Bruijn Graphs
 
De Bruijn Sequences for Fun and Profit
De Bruijn Sequences for Fun and ProfitDe Bruijn Sequences for Fun and Profit
De Bruijn Sequences for Fun and Profit
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
Primer design task
Primer design taskPrimer design task
Primer design task
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and Assembly
 
Pcr primer design english version
Pcr primer design english versionPcr primer design english version
Pcr primer design english version
 
PCR Primer desining
PCR Primer desiningPCR Primer desining
PCR Primer desining
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 

Similar to Genome assembly: An Introduction (2016)

Chap10 slides
Chap10 slidesChap10 slides
Chap10 slides
HJ DS
 

Similar to Genome assembly: An Introduction (2016) (20)

Ch06 multalign
Ch06 multalignCh06 multalign
Ch06 multalign
 
Biconnectivity
BiconnectivityBiconnectivity
Biconnectivity
 
On Sampling from Massive Graph Streams
On Sampling from Massive Graph StreamsOn Sampling from Massive Graph Streams
On Sampling from Massive Graph Streams
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Slender columnn and two-way slabs
Slender columnn and two-way slabsSlender columnn and two-way slabs
Slender columnn and two-way slabs
 
Unit ix graph
Unit   ix    graph Unit   ix    graph
Unit ix graph
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slides
 
VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES
VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES
VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES
 
Lecture15
Lecture15Lecture15
Lecture15
 
Unit 9 graph
Unit   9 graphUnit   9 graph
Unit 9 graph
 
Ivd soda-2019
Ivd soda-2019Ivd soda-2019
Ivd soda-2019
 
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
 
ch04.pdf
ch04.pdfch04.pdf
ch04.pdf
 
ch10.5.pptx
ch10.5.pptxch10.5.pptx
ch10.5.pptx
 
Hprec8 2
Hprec8 2Hprec8 2
Hprec8 2
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph Motif
 
Column math
Column mathColumn math
Column math
 
Curve clipping
Curve clippingCurve clipping
Curve clipping
 
Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...
 
Sampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying FrameworkSampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying Framework
 

Recently uploaded

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 

Genome assembly: An Introduction (2016)

  • 2. •  De novo genome assembly •  The fragment assembly problem •  Shortest superstring problem •  Seven bridges of Königsberg •  Assembly as a graph theoretical problem •  We construct a de Bruijn graph •  Underlying assumptions of genome assemblies 2Sebastian Schmeier
  • 3. •  The process of generating a new genome sequence from NGS genome sequence reads based on assembly algorithms •  Assembly involves joining short sequence fragments together into long pieces – contigs 3 reads genome sequencing Sebastian Schmeier
  • 10. •  Given:A set of reads (strings) {s1, s2, … , sn } •  Do: Determine a large string s that “best explains” the reads •  What do we mean by “best explains”? •  What assumptions might we require? 10Sebastian Schmeier
  • 11. •  Objective: Find a string s such that •  all reads s1, s2, … , sn are substrings of s •  s is as short as possible •  Assumptions: §  Reads are 100% accurate §  Identical reads must come from the same location on the genome §  “best” = “simplest” 11Sebastian Schmeier
  • 12. 12 ATGCG GCGTG GTGGC TGGCA 1 2 3 ATGCG GCGTG GTGGC TGGCA ATGCGTGGCA •  The assumption is that all substrings are represented •  Even modern sequencers that generate 100nt reads do not cover all possible 100-mers Sebastian Schmeier
  • 13. 13 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers Sebastian Schmeier
  • 14. 14 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers make them unique Sebastian Schmeier
  • 15. 15 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Draw edge from x to y where suffix from x overlaps prefix from y Sebastian Schmeier
  • 16. 16 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Draw edge from x to y where suffix from x overlaps prefix from y Sebastian Schmeier
  • 17. 17 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Draw edge from x to y where suffix from x overlaps prefix from y Sebastian Schmeier
  • 18. 18 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Draw edge from x to y where suffix from x overlaps prefix from y Sebastian Schmeier
  • 19. 19 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Find Hamiltonian path, that is, a path that visits every vertex exactly once Record the First le0er of each vertex + All le0ers of last vertex Sebastian Schmeier
  • 20. 20 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ß  UNFORTUNATELY:The Hamiltonian path problem is very difficult to solve (np-complete) ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Find Hamiltonian path, that is, a path that visits every vertex exactly once Record the First le0er of each vertex + All le0ers of last vertex ATGCGTGGCA Sebastian Schmeier
  • 21. 21 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ß  UNFORTUNATELY:The Hamiltonian path problem is very difficult to solve (np-complete) ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Find Hamiltonian path, that is, a path that visits every vertex exactly once Record the First le0er of each vertex + All le0ers of last vertex ATGCGTGGCA ATGGCGTGCA Sebastian Schmeier
  • 22. •  In 1735 Leonhard Euler was presented with the following problem: §  find a walk through the city that would cross each bridge once and only once §  He proved that a connected graph with undirected edges contains an Eulerian cycle exactly when every node in the graph has an even number of edges touching it. §  For the Königsberg Bridge graph, this is not the case because each of the four nodes has an odd number of edges touching it and so the desired stroll through the city does not exist. 22 https://en.wikipedia.org/wiki/Leonhard_Euler Sebastian Schmeier
  • 23. 23 A B D Vertex Directed edge •  The degree of a vertex: # of edges connected to it •  outdegree: # of outgoing edges •  indegree: # of ingoing edges •  degree(B)? •  outdegree(B)? •  indegree(D)? C Sebastian Schmeier
  • 24. •  The case of directed graphs is similar: §  A graph in which indegrees are equal to outdegrees for all nodes is called 'balanced'. §  Euler's theorem states that a connected directed graph has an Eulerian cycle if and only if it is balanced. •  Mathematically/computationally finding Eulerian path is much easier than Hamiltonian à we need to reformulate our assembly problem 24Sebastian Schmeier
  • 25. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 25 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: Sebastian Schmeier
  • 26. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 26 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT Sebastian Schmeier
  • 27. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 27 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG Sebastian Schmeier
  • 28. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 28 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG Sebastian Schmeier
  • 29. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 29 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT Sebastian Schmeier
  • 30. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 30 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG Sebastian Schmeier
  • 31. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 31 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG Sebastian Schmeier
  • 32. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 32 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG GGC TGC GTG GCG CGT GCA Sebastian Schmeier
  • 33. Can we find a DNA sequence containing all k-mers? à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once? à  Eulerian path •  a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1 •  a connected graph has an Eulerian path if and only if it contains at most two semibalanced vertices 33 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG GGC TGC GTG GCG CGT GCA Sebastian Schmeier
  • 34. Can we find a DNA sequence containing all k-mers? à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once? à  Eulerian path •  a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1 •  a connected graph has an Eulerian path if and only if it contains at most two semibalanced vertices 34 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG GGC TGC GTG GCG CGT GCA Sebastian Schmeier
  • 35. Can we find a DNA sequence containing all k-mers? à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once? à  Eulerian path 35 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG GGC TGC GTG GCG CGT GCA ATGGCGTGCA Sebastian Schmeier
  • 36. Can we find a DNA sequence containing all k-mers? à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once? à  Eulerian path 36 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG GGC TGC GTG GCG CGT GCA ATGGCGTGCA ATGCGTGGCA Sebastian Schmeier
  • 37. •  Four hidden assumptions that do not hold for next-generation sequencing We took for granted that: 1.  we can generate all k-mers present in the genome 2.  all k-mers are error free 3.  each k-mer appears at most once in the genome 4.  the genome consists of a single chromosome 37Sebastian Schmeier
  • 38. •  Four hidden assumptions that do not hold for next-generation sequencing We took for granted that: 1.  we can generate all k-mers present in the genome 2.  all k-mers are error free •  Due to these reasons we do NOT choose the longest possible k-mer •  The smaller the k-mer the higher the possibility that we see all k-mers •  Errors: 38 ATGGCGTGCA ATG TGG GGC GCG CGT GTG TGC GCA Mostly unaffected k-mers ATGGCGTGCA 100% affected k-mers K=3 K=10 Sebastian Schmeier
  • 39. Each k-mer appears at most once in the genome à repeats •  This is most often not true •  This is known as k-mer multiplicity 39 k-mers: ATG, GCA, TGC,TGC, GTG, GTG, GCG, GCG, CGT, CGT Distinct (k-1)-mers: AT TG GC CA CG GT ATG TGC GTG GCG CGT GCA ATGCGTGCGTGCA Sebastian Schmeier
  • 40. ‹#› 40 References How to apply de Bruijn graphs to genome assembly. Phillip E C Compeau, Pavel A Pevzner & GlennTesler. Nature Biotechnology 29, 987–991 (2011) doi:10.1038/nbt.2023 Published online 08 November 2011 Sequence Assembly. Lecture by Mark Craven (craven@biostat.wisc.edu). BMI/CS 576 (www.biostat.wisc.edu/bmi576/), Fall 2011 Sebastian Schmeier s.schmeier@gmail.com http://sschmeier.com/bioinf-workshop/