SlideShare a Scribd company logo
Title
Sebastian Schmeier
s.schmeier@gmail.com
http://sschmeier.com/bioinf-workshop/
2016
•  De novo genome assembly
•  The fragment assembly problem
•  Shortest superstring problem
•  Seven bridges of Königsberg
•  Assembly as a graph theoretical problem
•  We construct a de Bruijn graph
•  Underlying assumptions of genome assemblies
2Sebastian Schmeier
•  The process of generating a new genome sequence from NGS genome
sequence reads based on assembly algorithms
•  Assembly involves joining short sequence fragments together into long
pieces – contigs
3
reads
genome
sequencing
Sebastian Schmeier
4
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Sebastian Schmeier
5
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Vertex
Sebastian Schmeier
6
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Vertex
Edge
Sebastian Schmeier
7
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Vertex
Edge
Sebastian Schmeier
8
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Vertex
Edge
Sebastian Schmeier
9
ATGCG					
GCGTG	
GTGGC	
TGGCA	
Vertex
ATGCG					
GCGTG	
GTGGC	
TGGCA	
ATGCGTGGCA					
genome	
Edge
Sebastian Schmeier
•  Given:A set of reads (strings) {s1, s2, … , sn }
•  Do: Determine a large string s that “best explains” the reads
•  What do we mean by “best explains”?
•  What assumptions might we require?
10Sebastian Schmeier
•  Objective: Find a string s such that
•  all reads s1, s2, … , sn are substrings of s
•  s is as short as possible
•  Assumptions:
§  Reads are 100% accurate
§  Identical reads must come from the same location on the genome
§  “best” = “simplest”
11Sebastian Schmeier
12
ATGCG					
GCGTG	
GTGGC	
TGGCA	
1	
2	
3	
ATGCG					
GCGTG	
GTGGC	
TGGCA	
ATGCGTGGCA					
•  The assumption is that all substrings are represented
•  Even modern sequencers that generate 100nt reads do not cover all possible 100-mers
Sebastian Schmeier
13
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
Sebastian Schmeier
14
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
make	them	unique	
Sebastian Schmeier
15
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	 Draw	edge	from	x	to	y		
where	
suffix	from	x	overlaps	prefix	from	y	
Sebastian Schmeier
16
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	 Draw	edge	from	x	to	y		
where	
suffix	from	x	overlaps	prefix	from	y	
Sebastian Schmeier
17
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	 Draw	edge	from	x	to	y		
where	
suffix	from	x	overlaps	prefix	from	y	
Sebastian Schmeier
18
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	 Draw	edge	from	x	to	y		
where	
suffix	from	x	overlaps	prefix	from	y	
Sebastian Schmeier
19
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	
Find	Hamiltonian	path,	that	is,	a		
path	that	visits	every	vertex	exactly	once	
	
Record	the	First	le0er	of	each	vertex	+		
All	le0ers	of	last	vertex	
Sebastian Schmeier
20
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ß  UNFORTUNATELY:The Hamiltonian path problem is very difficult to solve (np-complete)
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	
Find	Hamiltonian	path,	that	is,	a		
path	that	visits	every	vertex	exactly	once	
	
Record	the	First	le0er	of	each	vertex	+		
All	le0ers	of	last	vertex	
ATGCGTGGCA					
Sebastian Schmeier
21
ATGCG					
GCGTG	
GTGGC	
TGGCA	
•  Thus, people generally use k-mers of certain length
ß  Here we use 3-mers by cutting the original reads into reads of length 3
ß  UNFORTUNATELY:The Hamiltonian path problem is very difficult to solve (np-complete)
ATG, TGC, GCG,
GCG, CGT, GTG
GTG,TGG GGC
TGG, GGC, GCA
3-mers	
ATG	
TGC	
GCG	
CGT	
GTG	
TGG	 GGC	
GCA	
Find	Hamiltonian	path,	that	is,	a		
path	that	visits	every	vertex	exactly	once	
	
Record	the	First	le0er	of	each	vertex	+		
All	le0ers	of	last	vertex	
ATGCGTGGCA					
ATGGCGTGCA	
Sebastian Schmeier
•  In 1735 Leonhard Euler was presented with the following problem:
§  find a walk through the city that would cross each bridge once and only once
§  He proved that a connected graph with undirected edges contains an Eulerian cycle
exactly when every node in the graph has an even number of edges touching it.
§  For the Königsberg Bridge graph, this is not the case because each of the four nodes
has an odd number of edges touching it and so the desired stroll through the city does
not exist.
22
https://en.wikipedia.org/wiki/Leonhard_Euler
Sebastian Schmeier
23
A	 B	
D	
Vertex
Directed edge
•  The degree of a vertex: # of edges connected to it
•  outdegree: # of outgoing edges
•  indegree: # of ingoing edges
•  degree(B)?
•  outdegree(B)?
•  indegree(D)?
C	
Sebastian Schmeier
•  The case of directed graphs is similar:
§  A graph in which indegrees are equal to outdegrees for all nodes is
called 'balanced'.
§  Euler's theorem states that a connected directed graph has an Eulerian
cycle if and only if it is balanced.
•  Mathematically/computationally finding Eulerian path is much easier than
Hamiltonian
à we need to reformulate our assembly problem
24Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
25
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
26
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
27
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
28
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
29
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
30
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
31
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG
Sebastian Schmeier
We construct a de Bruijn graph:
§  edges represent k-mers
§  vertices correspond to (k-1)-mers
1.  Form a node for every distinct prefix or suffix of a k-mer
2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x
(e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer.
32
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
Sebastian Schmeier
Can we find a DNA sequence containing all k-mers?
à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à  Eulerian path
•  a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1
•  a connected graph has an Eulerian path if and only if it contains at most two semibalanced
vertices
33
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
Sebastian Schmeier
Can we find a DNA sequence containing all k-mers?
à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à  Eulerian path
•  a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1
•  a connected graph has an Eulerian path if and only if it contains at most two semibalanced
vertices
34
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
Sebastian Schmeier
Can we find a DNA sequence containing all k-mers?
à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à  Eulerian path
35
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
ATGGCGTGCA	
Sebastian Schmeier
Can we find a DNA sequence containing all k-mers?
à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once?
à  Eulerian path
36
k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT
Distinct (k-1)-mers:
	
AT	 TG	
GG	
GC	 CA	
CG	GT	
ATG
TGG GGC
TGC
GTG GCG
CGT
GCA
ATGGCGTGCA	
ATGCGTGGCA	
Sebastian Schmeier
•  Four hidden assumptions that do not hold for next-generation sequencing
We took for granted that:
1.  we can generate all k-mers present in the genome
2.  all k-mers are error free
3.  each k-mer appears at most once in the genome
4.  the genome consists of a single chromosome
37Sebastian Schmeier
•  Four hidden assumptions that do not hold for next-generation sequencing
We took for granted that:
1.  we can generate all k-mers present in the genome
2.  all k-mers are error free
•  Due to these reasons we do NOT choose the longest possible k-mer
•  The smaller the k-mer the higher the possibility that we see all k-mers
•  Errors:
38
ATGGCGTGCA	
ATG			TGG			GGC			GCG			CGT			GTG			TGC			GCA	
Mostly	unaffected	k-mers	
ATGGCGTGCA	
100%	affected	k-mers	
K=3	
K=10	
Sebastian Schmeier
Each k-mer appears at most once in the genome à repeats
•  This is most often not true
•  This is known as k-mer multiplicity
39
k-mers:
ATG, GCA,
TGC,TGC, GTG, GTG, GCG, GCG, CGT, CGT
Distinct (k-1)-mers:
	
AT	 TG	 GC	 CA	
CG	GT	
ATG TGC
GTG GCG
CGT
GCA
ATGCGTGCGTGCA	
Sebastian Schmeier
‹#›	 40
References
How to apply de Bruijn graphs to genome assembly. Phillip E C Compeau, Pavel A Pevzner & GlennTesler. Nature Biotechnology
29, 987–991 (2011) doi:10.1038/nbt.2023 Published online 08 November 2011
Sequence Assembly. Lecture by Mark Craven (craven@biostat.wisc.edu). BMI/CS 576 (www.biostat.wisc.edu/bmi576/), Fall 2011
	
Sebastian Schmeier
s.schmeier@gmail.com
http://sschmeier.com/bioinf-workshop/

More Related Content

What's hot

Network components and biological network construction methods
Network components and biological network construction methodsNetwork components and biological network construction methods
Network components and biological network construction methods
Bioinformatics and Computational Biosciences Branch
 
Genome assembly
Genome assemblyGenome assembly
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
Jajati Keshari Nayak
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
Karan Veer Singh
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
rjorton
 
Whole exome sequencing(wes)
Whole exome sequencing(wes)Whole exome sequencing(wes)
Whole exome sequencing(wes)
Ibrahim Vazirabad
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
AGRF_Ltd
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysis
yuvraj404
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
Bioinformatics and Computational Biosciences Branch
 
2 whole genome sequencing and analysis
2 whole genome sequencing and analysis2 whole genome sequencing and analysis
2 whole genome sequencing and analysis
saberhussain9
 
Forward and reverse genetics
Forward and reverse geneticsForward and reverse genetics
Forward and reverse genetics
Vinod Pawar
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
mikaelhuss
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
kiran singh
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
Aureliano Bombarely
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
Bas van Breukelen
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
National Institute of Biologics
 
Genomics-Mapping and sequencing.pdf
Genomics-Mapping and sequencing.pdfGenomics-Mapping and sequencing.pdf
Genomics-Mapping and sequencing.pdf
shinycthomas
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
BITS
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
COST action BM1006
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
Scott Dawson
 

What's hot (20)

Network components and biological network construction methods
Network components and biological network construction methodsNetwork components and biological network construction methods
Network components and biological network construction methods
 
Genome assembly
Genome assemblyGenome assembly
Genome assembly
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Whole exome sequencing(wes)
Whole exome sequencing(wes)Whole exome sequencing(wes)
Whole exome sequencing(wes)
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysis
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
2 whole genome sequencing and analysis
2 whole genome sequencing and analysis2 whole genome sequencing and analysis
2 whole genome sequencing and analysis
 
Forward and reverse genetics
Forward and reverse geneticsForward and reverse genetics
Forward and reverse genetics
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 
Assembly and gene_prediction
Assembly and gene_predictionAssembly and gene_prediction
Assembly and gene_prediction
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
Genomics-Mapping and sequencing.pdf
Genomics-Mapping and sequencing.pdfGenomics-Mapping and sequencing.pdf
Genomics-Mapping and sequencing.pdf
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
2 md2016 annotation
2 md2016 annotation2 md2016 annotation
2 md2016 annotation
 

Viewers also liked

20101209 dnaseq pevzner
20101209 dnaseq pevzner20101209 dnaseq pevzner
20101209 dnaseq pevzner
Computer Science Club
 
How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets?
ehsan sepahi
 
de Bruijn Graph Construction from Combination of Short and Long Reads
de Bruijn Graph Construction from Combination of Short and Long Readsde Bruijn Graph Construction from Combination of Short and Long Reads
de Bruijn Graph Construction from Combination of Short and Long Reads
Sikder Tahsin Al-Amin
 
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Anton Alexandrov
 
Colored de Bruijn Graphs
Colored de Bruijn GraphsColored de Bruijn Graphs
Colored de Bruijn Graphs
Marcos Castro
 
De Bruijn Sequences for Fun and Profit
De Bruijn Sequences for Fun and ProfitDe Bruijn Sequences for Fun and Profit
De Bruijn Sequences for Fun and Profit
Aleksandar Bradic
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
Sebastian Schmeier
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
Sebastian Schmeier
 
Primer design task
Primer design taskPrimer design task
Primer design task
Karan Veer Singh
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and Assembly
Shaun Jackman
 
Pcr primer design english version
Pcr primer design english versionPcr primer design english version
Pcr primer design english version
Mahidol University, Thailand
 
PCR Primer desining
PCR Primer desiningPCR Primer desining
PCR Primer desining
Karan Veer Singh
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
Dayananda Salam
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
Archa Dave
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
François PAILLIER
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
VHIR Vall d’Hebron Institut de Recerca
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
Annelies Haegeman
 

Viewers also liked (17)

20101209 dnaseq pevzner
20101209 dnaseq pevzner20101209 dnaseq pevzner
20101209 dnaseq pevzner
 
How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets?
 
de Bruijn Graph Construction from Combination of Short and Long Reads
de Bruijn Graph Construction from Combination of Short and Long Readsde Bruijn Graph Construction from Combination of Short and Long Reads
de Bruijn Graph Construction from Combination of Short and Long Reads
 
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
 
Colored de Bruijn Graphs
Colored de Bruijn GraphsColored de Bruijn Graphs
Colored de Bruijn Graphs
 
De Bruijn Sequences for Fun and Profit
De Bruijn Sequences for Fun and ProfitDe Bruijn Sequences for Fun and Profit
De Bruijn Sequences for Fun and Profit
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
Primer design task
Primer design taskPrimer design task
Primer design task
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and Assembly
 
Pcr primer design english version
Pcr primer design english versionPcr primer design english version
Pcr primer design english version
 
PCR Primer desining
PCR Primer desiningPCR Primer desining
PCR Primer desining
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 

Similar to Genome assembly: An Introduction (2016)

Ch06 multalign
Ch06 multalignCh06 multalign
Ch06 multalign
BioinformaticsInstitute
 
Biconnectivity
BiconnectivityBiconnectivity
Biconnectivity
msramanujan
 
On Sampling from Massive Graph Streams
On Sampling from Massive Graph StreamsOn Sampling from Massive Graph Streams
On Sampling from Massive Graph Streams
Nesreen K. Ahmed
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
Carlos Castillo (ChaTo)
 
Slender columnn and two-way slabs
Slender columnn and two-way slabsSlender columnn and two-way slabs
Slender columnn and two-way slabs
civilengineeringfreedownload
 
Unit ix graph
Unit   ix    graph Unit   ix    graph
Unit ix graph
Tribhuvan University
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slides
HJ DS
 
VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES
VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES
VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES
JHON ANYELO TIBURCIO BAUTISTA
 
Lecture15
Lecture15Lecture15
Unit 9 graph
Unit   9 graphUnit   9 graph
Unit 9 graph
Dabbal Singh Mahara
 
Ivd soda-2019
Ivd soda-2019Ivd soda-2019
Ivd soda-2019
AkankshaAgrawal55
 
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
NAVER Engineering
 
ch04.pdf
ch04.pdfch04.pdf
ch10.5.pptx
ch10.5.pptxch10.5.pptx
ch10.5.pptx
GauravGautam216125
 
Hprec8 2
Hprec8 2Hprec8 2
Hprec8 2
stevenhbills
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph Motif
AMR koura
 
Column math
Column mathColumn math
Column math
Dr. H.M.A. Mahzuz
 
Curve clipping
Curve clippingCurve clipping
Curve clipping
Mohamed El-Serngawy
 
Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...
Till Blume
 
Sampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying FrameworkSampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying Framework
Nesreen K. Ahmed
 

Similar to Genome assembly: An Introduction (2016) (20)

Ch06 multalign
Ch06 multalignCh06 multalign
Ch06 multalign
 
Biconnectivity
BiconnectivityBiconnectivity
Biconnectivity
 
On Sampling from Massive Graph Streams
On Sampling from Massive Graph StreamsOn Sampling from Massive Graph Streams
On Sampling from Massive Graph Streams
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Slender columnn and two-way slabs
Slender columnn and two-way slabsSlender columnn and two-way slabs
Slender columnn and two-way slabs
 
Unit ix graph
Unit   ix    graph Unit   ix    graph
Unit ix graph
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slides
 
VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES
VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES
VIGAS CONJUGADAS - RESISTENCIA DE LOS MATERIALES
 
Lecture15
Lecture15Lecture15
Lecture15
 
Unit 9 graph
Unit   9 graphUnit   9 graph
Unit 9 graph
 
Ivd soda-2019
Ivd soda-2019Ivd soda-2019
Ivd soda-2019
 
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
대용량의 동적인 그래프 및 텐서 마이닝 (Mining Large Dynamic Graphs and Tensors)
 
ch04.pdf
ch04.pdfch04.pdf
ch04.pdf
 
ch10.5.pptx
ch10.5.pptxch10.5.pptx
ch10.5.pptx
 
Hprec8 2
Hprec8 2Hprec8 2
Hprec8 2
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph Motif
 
Column math
Column mathColumn math
Column math
 
Curve clipping
Curve clippingCurve clipping
Curve clipping
 
Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...Incremental and parallel computation of structural graph summaries for evolvi...
Incremental and parallel computation of structural graph summaries for evolvi...
 
Sampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying FrameworkSampling from Massive Graph Streams: A Unifying Framework
Sampling from Massive Graph Streams: A Unifying Framework
 

Recently uploaded

Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
chanes7
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
simonomuemu
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
NgcHiNguyn25
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
Assessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptxAssessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptx
Kavitha Krishnan
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Excellence Foundation for South Sudan
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 

Recently uploaded (20)

Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
Assessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptxAssessment and Planning in Educational technology.pptx
Assessment and Planning in Educational technology.pptx
 
Your Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective UpskillingYour Skill Boost Masterclass: Strategies for Effective Upskilling
Your Skill Boost Masterclass: Strategies for Effective Upskilling
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 

Genome assembly: An Introduction (2016)

  • 2. •  De novo genome assembly •  The fragment assembly problem •  Shortest superstring problem •  Seven bridges of Königsberg •  Assembly as a graph theoretical problem •  We construct a de Bruijn graph •  Underlying assumptions of genome assemblies 2Sebastian Schmeier
  • 3. •  The process of generating a new genome sequence from NGS genome sequence reads based on assembly algorithms •  Assembly involves joining short sequence fragments together into long pieces – contigs 3 reads genome sequencing Sebastian Schmeier
  • 10. •  Given:A set of reads (strings) {s1, s2, … , sn } •  Do: Determine a large string s that “best explains” the reads •  What do we mean by “best explains”? •  What assumptions might we require? 10Sebastian Schmeier
  • 11. •  Objective: Find a string s such that •  all reads s1, s2, … , sn are substrings of s •  s is as short as possible •  Assumptions: §  Reads are 100% accurate §  Identical reads must come from the same location on the genome §  “best” = “simplest” 11Sebastian Schmeier
  • 12. 12 ATGCG GCGTG GTGGC TGGCA 1 2 3 ATGCG GCGTG GTGGC TGGCA ATGCGTGGCA •  The assumption is that all substrings are represented •  Even modern sequencers that generate 100nt reads do not cover all possible 100-mers Sebastian Schmeier
  • 13. 13 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers Sebastian Schmeier
  • 14. 14 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers make them unique Sebastian Schmeier
  • 15. 15 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Draw edge from x to y where suffix from x overlaps prefix from y Sebastian Schmeier
  • 16. 16 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Draw edge from x to y where suffix from x overlaps prefix from y Sebastian Schmeier
  • 17. 17 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Draw edge from x to y where suffix from x overlaps prefix from y Sebastian Schmeier
  • 18. 18 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Draw edge from x to y where suffix from x overlaps prefix from y Sebastian Schmeier
  • 19. 19 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Find Hamiltonian path, that is, a path that visits every vertex exactly once Record the First le0er of each vertex + All le0ers of last vertex Sebastian Schmeier
  • 20. 20 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ß  UNFORTUNATELY:The Hamiltonian path problem is very difficult to solve (np-complete) ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Find Hamiltonian path, that is, a path that visits every vertex exactly once Record the First le0er of each vertex + All le0ers of last vertex ATGCGTGGCA Sebastian Schmeier
  • 21. 21 ATGCG GCGTG GTGGC TGGCA •  Thus, people generally use k-mers of certain length ß  Here we use 3-mers by cutting the original reads into reads of length 3 ß  UNFORTUNATELY:The Hamiltonian path problem is very difficult to solve (np-complete) ATG, TGC, GCG, GCG, CGT, GTG GTG,TGG GGC TGG, GGC, GCA 3-mers ATG TGC GCG CGT GTG TGG GGC GCA Find Hamiltonian path, that is, a path that visits every vertex exactly once Record the First le0er of each vertex + All le0ers of last vertex ATGCGTGGCA ATGGCGTGCA Sebastian Schmeier
  • 22. •  In 1735 Leonhard Euler was presented with the following problem: §  find a walk through the city that would cross each bridge once and only once §  He proved that a connected graph with undirected edges contains an Eulerian cycle exactly when every node in the graph has an even number of edges touching it. §  For the Königsberg Bridge graph, this is not the case because each of the four nodes has an odd number of edges touching it and so the desired stroll through the city does not exist. 22 https://en.wikipedia.org/wiki/Leonhard_Euler Sebastian Schmeier
  • 23. 23 A B D Vertex Directed edge •  The degree of a vertex: # of edges connected to it •  outdegree: # of outgoing edges •  indegree: # of ingoing edges •  degree(B)? •  outdegree(B)? •  indegree(D)? C Sebastian Schmeier
  • 24. •  The case of directed graphs is similar: §  A graph in which indegrees are equal to outdegrees for all nodes is called 'balanced'. §  Euler's theorem states that a connected directed graph has an Eulerian cycle if and only if it is balanced. •  Mathematically/computationally finding Eulerian path is much easier than Hamiltonian à we need to reformulate our assembly problem 24Sebastian Schmeier
  • 25. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 25 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: Sebastian Schmeier
  • 26. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 26 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT Sebastian Schmeier
  • 27. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 27 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG Sebastian Schmeier
  • 28. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 28 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG Sebastian Schmeier
  • 29. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 29 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT Sebastian Schmeier
  • 30. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 30 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG Sebastian Schmeier
  • 31. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 31 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG Sebastian Schmeier
  • 32. We construct a de Bruijn graph: §  edges represent k-mers §  vertices correspond to (k-1)-mers 1.  Form a node for every distinct prefix or suffix of a k-mer 2.  Connect vertex x to vertex y with a directed edge if some k-mer (e.g.,ATG) has prefix x (e.g.,AT) and suffix y (e.g.,TG), and label the edge with this k-mer. 32 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG GGC TGC GTG GCG CGT GCA Sebastian Schmeier
  • 33. Can we find a DNA sequence containing all k-mers? à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once? à  Eulerian path •  a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1 •  a connected graph has an Eulerian path if and only if it contains at most two semibalanced vertices 33 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG GGC TGC GTG GCG CGT GCA Sebastian Schmeier
  • 34. Can we find a DNA sequence containing all k-mers? à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once? à  Eulerian path •  a vertex v is semibalanced if |indegree(v) – outdegree(v)| = 1 •  a connected graph has an Eulerian path if and only if it contains at most two semibalanced vertices 34 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG GGC TGC GTG GCG CGT GCA Sebastian Schmeier
  • 35. Can we find a DNA sequence containing all k-mers? à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once? à  Eulerian path 35 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG GGC TGC GTG GCG CGT GCA ATGGCGTGCA Sebastian Schmeier
  • 36. Can we find a DNA sequence containing all k-mers? à  In a de Bruijn graph, can we find a path that visits every edge of the graph exactly once? à  Eulerian path 36 k-mers: ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT Distinct (k-1)-mers: AT TG GG GC CA CG GT ATG TGG GGC TGC GTG GCG CGT GCA ATGGCGTGCA ATGCGTGGCA Sebastian Schmeier
  • 37. •  Four hidden assumptions that do not hold for next-generation sequencing We took for granted that: 1.  we can generate all k-mers present in the genome 2.  all k-mers are error free 3.  each k-mer appears at most once in the genome 4.  the genome consists of a single chromosome 37Sebastian Schmeier
  • 38. •  Four hidden assumptions that do not hold for next-generation sequencing We took for granted that: 1.  we can generate all k-mers present in the genome 2.  all k-mers are error free •  Due to these reasons we do NOT choose the longest possible k-mer •  The smaller the k-mer the higher the possibility that we see all k-mers •  Errors: 38 ATGGCGTGCA ATG TGG GGC GCG CGT GTG TGC GCA Mostly unaffected k-mers ATGGCGTGCA 100% affected k-mers K=3 K=10 Sebastian Schmeier
  • 39. Each k-mer appears at most once in the genome à repeats •  This is most often not true •  This is known as k-mer multiplicity 39 k-mers: ATG, GCA, TGC,TGC, GTG, GTG, GCG, GCG, CGT, CGT Distinct (k-1)-mers: AT TG GC CA CG GT ATG TGC GTG GCG CGT GCA ATGCGTGCGTGCA Sebastian Schmeier
  • 40. ‹#› 40 References How to apply de Bruijn graphs to genome assembly. Phillip E C Compeau, Pavel A Pevzner & GlennTesler. Nature Biotechnology 29, 987–991 (2011) doi:10.1038/nbt.2023 Published online 08 November 2011 Sequence Assembly. Lecture by Mark Craven (craven@biostat.wisc.edu). BMI/CS 576 (www.biostat.wisc.edu/bmi576/), Fall 2011 Sebastian Schmeier s.schmeier@gmail.com http://sschmeier.com/bioinf-workshop/