Bas E. Dutilh
Bacteriófagos: Aspectos básicos y moleculares. Aplicaciones Biotecnológicas
Buenos Aires, June 29th 2015
Shotgun sequence assembly
Method Read length Accuracy Million
reads
Time Cost
per M
454 100-700 99% 1 1 day $10
Illumina 50-300 98% 3,000 1-2 days $0.10
IonTorrent 100-400 98% 40-80 2 hours $1
PacBio 1,000-30,000 87% 0.05 2 hours $1
Sanger 400-1,200 99.9% n/a 2 hours $2,400
SOLiD 50 99.9% 1,200 1-2 weeks $0.13
Sequencing specs*
* these numbers change all the time!
Lengths of reads and genomes
 NGS technologies provide reads of 50 to max.
30,000 bp, but most genomes are much longer
Gago, Science 2009
Nucleotide codes
Description Bases
A Adenine A
1
C Cytosine C
G Guanine G
T Thymine T
U Uracil U
W Weak A T
2
S Strong C G
M aMino A C
K Keto G T
R puRine A G
Y pYrimidine C T
B not A (B after A) C G T
3
D not C (D after C) A G T
H not G (H after G) A C T
V not T (V after T/U) A C G
N aNy base (not a gap) A C G T 4
- Gap (no nucleotide) 0
Sequence File Formats
• Different file formats for different uses
• Competing formats developed in parallel
• Some easy to read, some easy to parse
• Simplest sequence file format
• Unique identifiers!
• “Fasta wide” format has the whole sequence on one line
• Even easier to parse in a computer script
Fasta
>identifier1 [optional information]
CCGATCATATGACTAGCATGCATCGATCGATCGACTAGCATTT
AGAGCTACGATCAGCACTACACGCTTTGTATGATTGGCGGCGG
CTATTATATTGGGA
>identifier2 [optional information]
GAGAGCTACGATCAGAGCTACGATCAGCACTACACGCTTTGTA
TGATTGGCCCCCTATATTGGGACACGATCAGCACTACACGCTT
TGTATGATTGGCGGCGGCTATCCGATCAT
• Based on Fasta format
• Contains information about quality of each nucleotide
• Quality estimated by sequencing machine
@SRR014849.1 EIXKN4201CFU84 length=93
GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC
+
hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E%
• Four lines per sequence:
1. Identifier line starting with @
2. DNA sequence on one line
3. Second identifier line starting with + (identifier optional)
4. String of quality scores on one line
Fastq
Quality scores
 Phred 10: 10-1 chance that the base is wrong
 90% accuracy; 10% error rate
 Phred 20: 10-2 chance that the base is wrong
 99% accuracy ; 1% error rate
 Phred 30: 10-3 chance that the base is wrong
 99.9% accuracy ; 0.1% error rate
 Etcetera
ASCII character codes
 Fastq quality score: Phred score + 33, converted to
ASCII text
 Note: old Illumina format was different!
@SRR014849.1 EIXKN4201CFU84 length=93
GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC
+
hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E%
Quality profile of reads
March 2011
Quality profile of reads
October 2011
Random genome, random coverage
• Average depth:
– Genome size G
– Base depth B=40x
– Read length L=100 bp
– K-mer size K=25 bp
C = B * (L - K + 1) / L
• Uncovered bases:
u = G * eC
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
What is easier to assemble?
Random sequences, or real genomes?
Sequence assembly
Reads
Scaffold
 Order and orientation of contigs
 Sizes of the gaps between contigs (filled with NNN)
Contigs
 Consensus sequence of assembled reads
 Includes alignment of all reads
Horizontal coverage
Depth
Coverage
Assembly of shotgun sequences
• Human genome project
–1-2 kb Sanger reads
–< 10x coverage
–Low error rate
• High-throughput (meta-)genomics
– Millions/billions of ~100-400 bp reads
– Mix of genomes with different coverage
– Biases and sequencing errors
• Quality drops towards the end of reads
• Homo-polymers may be miss-called in 454 or Ion Torrent
2000
NOW
Assembly strategies
• Reference-guided assembly
– Align reads to a (database) of reference genome(s)
– Cannot discover:
• Larger genomic mutations
– Insertions, deletions, rearrangements
• Distantly related species
• Most viruses
• De novo assembly
– Requires sufficient coverage x depth
– Breaks on repeats and low-coverage regions
– Algorithms
• Greedy assembly (only to illustrate)
• Overlap-layout-consensus
• De Bruijn graph
Reference-guided assembly
• Illumina sequencing of community DNA
• Same-species genome available (2.8M nt)
• Sometimes, only a minority of the reads can be
mapped/aligned
Distant reference
• Natural diversity of community
– “Species” share >94% average nucleotide identity
– Consensus = “average” of the species
Consensus
Genome space
Reference
Konstantinidis and
Tiedje, PNAS 2004
• The assembly is a better representation of the community
• Can we further approach the consensus genome by re-
mapping the reads against this first assembly?
Reference
Genome space
Iterative mapping and assembly
First assembly
Consensus
Dutilh et al. Bioinformatics 2009
Iteration improves assembly
• More mapped reads
• Fewer gaps
Dutilh et al. Bioinformatics 2009
De novo assembly
Assembly: AACAAGTTA
AACAAGT
CAAGTTA
De novo assembly approaches
• Greedy approach
• Overlap-layout-consensus
• De Bruijn graphs
Greedy assembly
1. Sequences (reads)
2. Pairwise all-vs-all similarities
3. Find best matching pair
4. Collapse/assemble
• Works well for few, long reads (Sanger)
– All-vs-all calculations are expensive
– One clear best match
• Does not work for high throughput NGS datasets
– Many reads -> expensive to calculate
– Low coverage requires graph approach
(reads/contigs)
Repetitive sequences
• Reads A-D are from a region with two long repeats
• Greedy approach would first join A-D with the
largest overlap, and place B-C in a separate contig
• Resolving this requires a global view of all the
possibilities before joining two reads: a graph
repeat repeat
DA
C
B
D C
BA
B
D
What is easier to assemble?
Random sequences, or real genomes?
Assembly as a “graph” problem
• Overlap-layout-consensus
• De Bruijn Graph
• A graph contains nodes and edges
node edge
1. Identify all overlaps between reads
– Use cutoffs: minimum overlap and percent identity
2. Make graph of overlap connections
– Nodes: reads
– Edges: overlaps
3. Find Hamiltonian path
– Path that contains every node once
– No efficient algorithm available
4. Determine consensus at each position
TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
Overlap-layout-consensus
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT
CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT
GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT
CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC
K N
LJ
M
K NLJ M
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
De Bruijn graph
1. Find every word of length k (k-mer) in every read
– K-mer should be long enough to be quite unique, but
– … short enough to not break on polymorphisms/errors
TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT
GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT
CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC
CTTGATACTAATGCTTTTTGTAATCTTAT
TTGATACTAATGCTTTTTGTAATCTTATT
TGATACTAATGCTTTTTGTAATCTTATTG
GATACTAATGCTTTTTGTAATCTTATTGG
ATACTAATGCTTTTTGTAATCTTATTGGT
TACTAATGCTTTTTGTAATCTTATTGGTT
ACTAATGCTTTTTGTAATCTTATTGGTTG
CTAATGCTTTTTGTAATCTTATTGGTTGG
TAATGCTTTTTGTAATCTTATTGGTTGGC
AATGCTTTTTGTAATCTTATTGGTTGGCT
ATGCTTTTTGTAATCTTATTGGTTGGCTT
TGCTTTTTGTAATCTTATTGGTTGGCTTA
GCTTTTTGTAATCTTATTGGTTGGCTTAA
CTTTTTGTAATCTTATTGGTTGGCTTAAA
TTTTTGTAATCTTATTGGTTGGCTTAAAC
K N
L
M
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT
J
De Bruijn graph
2. Make graph of sequential k-mers in sequence
– Nodes: k-mers
– Edges: sequential presence of k-mers in reads
TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT
CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT
GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT
CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC
CTTGATACTAATGCTTTTTGTAATCTTAT
TTGATACTAATGCTTTTTGTAATCTTATT
TGATACTAATGCTTTTTGTAATCTTATTG
GATACTAATGCTTTTTGTAATCTTATTGG
ATACTAATGCTTTTTGTAATCTTATTGGT
TACTAATGCTTTTTGTAATCTTATTGGTT
ACTAATGCTTTTTGTAATCTTATTGGTTG
CTAATGCTTTTTGTAATCTTATTGGTTGG
TAATGCTTTTTGTAATCTTATTGGTTGGC
AATGCTTTTTGTAATCTTATTGGTTGGCT
ATGCTTTTTGTAATCTTATTGGTTGGCTT
TGCTTTTTGTAATCTTATTGGTTGGCTTA
GCTTTTTGTAATCTTATTGGTTGGCTTAA
CTTTTTGTAATCTTATTGGTTGGCTTAAA
TTTTTGTAATCTTATTGGTTGGCTTAAAC
K N
L
J
M
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
De Bruijn graph
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
3. Find Eulerian path
– Path that contains every edge once
– Efficient algorithm available
• In an optimal sequencing run of a repeat-less
genome, there is one path connecting all nodes
• In practice (especially in metagenomes) there are
many possible structures in the graph
• Edge width represents the number of linking
reads (depth)
Possible structures in De Bruijn graphs
• Cycle: path converges on itself
– Repeated region on the same contig
• Frayed rope: converge then diverge
– Repeated region on different contigs
• Bubble: paths diverge then converge
– Sequencing error in the middle of a read
– Polymorphisms
• Spur: short dead-ends
– Sequencing error at the end of a read
– Zero coverage shortly after end of repeat
What is easier to assemble?
Random sequences, or real genomes?
Examples of De Bruijn graphs
1% 5%
10% 15%
Pell PNAS 2012
Sequencing errors in a
random circular sequence
Peng Bioinformatics 2011
Five E. coli subspecies
Random versus real sequences
 Biological sequences are not random
 Genes, operons, promoters, etcetera
 Biased nucleotide usage (GC content)
 Biased oligonucleotide usage (k-mers)
 Repeated sequences in (meta-)genomes
 Low-complexity regions
 Conserved protein domains
 Duplicated genes, horizontal transfers
 “Selfish” elements (e.g. transposons, prophages)
 Polymorphic repeats (haplotypes, strains)
 …etcetera
Repeats have multiple sinks/sources
Repeats have multiple sinks/sources
16s
Salmonella has 7 rrn operons
Salmonella recombines at rrn operons
Helm and Maloy
Repeated regions
• In overlap-layout-consensus and De Bruijn graphs
reads
K-mers
Li BFG 2012
Genome versus metagenome
• Depending on diversity
– Expect many sequences
– Fragmented sequences
– Varying read depth
• Natural microdiversity
• Sequencing errors or
natural diversity?
• Repeats also include
closely related strains,
conserved genes, etc.
• Depending on coverage
– Expect single sequence
– Contiguous sequence
– Even read depth
• Clonal sequence
• Identify sequencing
errors by low coverage
• Repeats consist of
duplicated genes and
conserved domains
Chimerization in metagenome assembly
• Both OLC and DBG include “chimera protection”
– Break contigs at ambiguities
– Works if depth/coverage is high enough
contig1
contig2
contig4
contig5
contig3
• Assess final result with different parameters
– High versus low stringency assembly
• Chimerization is more frequent between
closely related strains
Assembly strategies
• Reference-guided assembly
– Align reads to a (database) of reference genome(s)
– Cannot discover:
• Larger genomic mutations
– Insertions, deletions, rearrangements
• Distantly related species
• Most viruses
• De novo assembly
– Requires sufficient read lengths, depth, and coverage
– Breaks on long repeats and low-coverage regions
– Algorithms
• Greedy assembly (only to illustrate)
• Overlap-layout-consensus
• De Bruijn graph
Scaffolding
• Use alignments to a related genome sequences to
sort and orient de novo contigs
Silva et al. Source Code Biol. Med. 2013

Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)

  • 1.
    Bas E. Dutilh Bacteriófagos:Aspectos básicos y moleculares. Aplicaciones Biotecnológicas Buenos Aires, June 29th 2015 Shotgun sequence assembly
  • 2.
    Method Read lengthAccuracy Million reads Time Cost per M 454 100-700 99% 1 1 day $10 Illumina 50-300 98% 3,000 1-2 days $0.10 IonTorrent 100-400 98% 40-80 2 hours $1 PacBio 1,000-30,000 87% 0.05 2 hours $1 Sanger 400-1,200 99.9% n/a 2 hours $2,400 SOLiD 50 99.9% 1,200 1-2 weeks $0.13 Sequencing specs* * these numbers change all the time!
  • 4.
    Lengths of readsand genomes  NGS technologies provide reads of 50 to max. 30,000 bp, but most genomes are much longer Gago, Science 2009
  • 5.
    Nucleotide codes Description Bases AAdenine A 1 C Cytosine C G Guanine G T Thymine T U Uracil U W Weak A T 2 S Strong C G M aMino A C K Keto G T R puRine A G Y pYrimidine C T B not A (B after A) C G T 3 D not C (D after C) A G T H not G (H after G) A C T V not T (V after T/U) A C G N aNy base (not a gap) A C G T 4 - Gap (no nucleotide) 0
  • 6.
    Sequence File Formats •Different file formats for different uses • Competing formats developed in parallel • Some easy to read, some easy to parse
  • 7.
    • Simplest sequencefile format • Unique identifiers! • “Fasta wide” format has the whole sequence on one line • Even easier to parse in a computer script Fasta >identifier1 [optional information] CCGATCATATGACTAGCATGCATCGATCGATCGACTAGCATTT AGAGCTACGATCAGCACTACACGCTTTGTATGATTGGCGGCGG CTATTATATTGGGA >identifier2 [optional information] GAGAGCTACGATCAGAGCTACGATCAGCACTACACGCTTTGTA TGATTGGCCCCCTATATTGGGACACGATCAGCACTACACGCTT TGTATGATTGGCGGCGGCTATCCGATCAT
  • 8.
    • Based onFasta format • Contains information about quality of each nucleotide • Quality estimated by sequencing machine @SRR014849.1 EIXKN4201CFU84 length=93 GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC + hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E% • Four lines per sequence: 1. Identifier line starting with @ 2. DNA sequence on one line 3. Second identifier line starting with + (identifier optional) 4. String of quality scores on one line Fastq
  • 9.
    Quality scores  Phred10: 10-1 chance that the base is wrong  90% accuracy; 10% error rate  Phred 20: 10-2 chance that the base is wrong  99% accuracy ; 1% error rate  Phred 30: 10-3 chance that the base is wrong  99.9% accuracy ; 0.1% error rate  Etcetera
  • 10.
    ASCII character codes Fastq quality score: Phred score + 33, converted to ASCII text  Note: old Illumina format was different! @SRR014849.1 EIXKN4201CFU84 length=93 GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC + hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E%
  • 11.
    Quality profile ofreads March 2011
  • 12.
    Quality profile ofreads October 2011
  • 13.
    Random genome, randomcoverage • Average depth: – Genome size G – Base depth B=40x – Read length L=100 bp – K-mer size K=25 bp C = B * (L - K + 1) / L • Uncovered bases: u = G * eC CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
  • 14.
    What is easierto assemble? Random sequences, or real genomes?
  • 15.
    Sequence assembly Reads Scaffold  Orderand orientation of contigs  Sizes of the gaps between contigs (filled with NNN) Contigs  Consensus sequence of assembled reads  Includes alignment of all reads
  • 16.
  • 17.
    Assembly of shotgunsequences • Human genome project –1-2 kb Sanger reads –< 10x coverage –Low error rate • High-throughput (meta-)genomics – Millions/billions of ~100-400 bp reads – Mix of genomes with different coverage – Biases and sequencing errors • Quality drops towards the end of reads • Homo-polymers may be miss-called in 454 or Ion Torrent 2000 NOW
  • 18.
    Assembly strategies • Reference-guidedassembly – Align reads to a (database) of reference genome(s) – Cannot discover: • Larger genomic mutations – Insertions, deletions, rearrangements • Distantly related species • Most viruses • De novo assembly – Requires sufficient coverage x depth – Breaks on repeats and low-coverage regions – Algorithms • Greedy assembly (only to illustrate) • Overlap-layout-consensus • De Bruijn graph
  • 19.
    Reference-guided assembly • Illuminasequencing of community DNA • Same-species genome available (2.8M nt) • Sometimes, only a minority of the reads can be mapped/aligned
  • 20.
    Distant reference • Naturaldiversity of community – “Species” share >94% average nucleotide identity – Consensus = “average” of the species Consensus Genome space Reference Konstantinidis and Tiedje, PNAS 2004
  • 21.
    • The assemblyis a better representation of the community • Can we further approach the consensus genome by re- mapping the reads against this first assembly? Reference Genome space Iterative mapping and assembly First assembly Consensus Dutilh et al. Bioinformatics 2009
  • 22.
    Iteration improves assembly •More mapped reads • Fewer gaps Dutilh et al. Bioinformatics 2009
  • 23.
    De novo assembly Assembly:AACAAGTTA AACAAGT CAAGTTA
  • 24.
    De novo assemblyapproaches • Greedy approach • Overlap-layout-consensus • De Bruijn graphs
  • 25.
    Greedy assembly 1. Sequences(reads) 2. Pairwise all-vs-all similarities 3. Find best matching pair 4. Collapse/assemble • Works well for few, long reads (Sanger) – All-vs-all calculations are expensive – One clear best match • Does not work for high throughput NGS datasets – Many reads -> expensive to calculate – Low coverage requires graph approach (reads/contigs)
  • 26.
    Repetitive sequences • ReadsA-D are from a region with two long repeats • Greedy approach would first join A-D with the largest overlap, and place B-C in a separate contig • Resolving this requires a global view of all the possibilities before joining two reads: a graph repeat repeat DA C B D C BA B D
  • 27.
    What is easierto assemble? Random sequences, or real genomes?
  • 28.
    Assembly as a“graph” problem • Overlap-layout-consensus • De Bruijn Graph • A graph contains nodes and edges node edge
  • 29.
    1. Identify alloverlaps between reads – Use cutoffs: minimum overlap and percent identity 2. Make graph of overlap connections – Nodes: reads – Edges: overlaps 3. Find Hamiltonian path – Path that contains every node once – No efficient algorithm available 4. Determine consensus at each position TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA Overlap-layout-consensus CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC K N LJ M K NLJ M CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
  • 30.
    De Bruijn graph 1.Find every word of length k (k-mer) in every read – K-mer should be long enough to be quite unique, but – … short enough to not break on polymorphisms/errors TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC CTTGATACTAATGCTTTTTGTAATCTTAT TTGATACTAATGCTTTTTGTAATCTTATT TGATACTAATGCTTTTTGTAATCTTATTG GATACTAATGCTTTTTGTAATCTTATTGG ATACTAATGCTTTTTGTAATCTTATTGGT TACTAATGCTTTTTGTAATCTTATTGGTT ACTAATGCTTTTTGTAATCTTATTGGTTG CTAATGCTTTTTGTAATCTTATTGGTTGG TAATGCTTTTTGTAATCTTATTGGTTGGC AATGCTTTTTGTAATCTTATTGGTTGGCT ATGCTTTTTGTAATCTTATTGGTTGGCTT TGCTTTTTGTAATCTTATTGGTTGGCTTA GCTTTTTGTAATCTTATTGGTTGGCTTAA CTTTTTGTAATCTTATTGGTTGGCTTAAA TTTTTGTAATCTTATTGGTTGGCTTAAAC K N L M CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT J
  • 31.
    De Bruijn graph 2.Make graph of sequential k-mers in sequence – Nodes: k-mers – Edges: sequential presence of k-mers in reads TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC CTTGATACTAATGCTTTTTGTAATCTTAT TTGATACTAATGCTTTTTGTAATCTTATT TGATACTAATGCTTTTTGTAATCTTATTG GATACTAATGCTTTTTGTAATCTTATTGG ATACTAATGCTTTTTGTAATCTTATTGGT TACTAATGCTTTTTGTAATCTTATTGGTT ACTAATGCTTTTTGTAATCTTATTGGTTG CTAATGCTTTTTGTAATCTTATTGGTTGG TAATGCTTTTTGTAATCTTATTGGTTGGC AATGCTTTTTGTAATCTTATTGGTTGGCT ATGCTTTTTGTAATCTTATTGGTTGGCTT TGCTTTTTGTAATCTTATTGGTTGGCTTA GCTTTTTGTAATCTTATTGGTTGGCTTAA CTTTTTGTAATCTTATTGGTTGGCTTAAA TTTTTGTAATCTTATTGGTTGGCTTAAAC K N L J M CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
  • 32.
    De Bruijn graph CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA 3.Find Eulerian path – Path that contains every edge once – Efficient algorithm available • In an optimal sequencing run of a repeat-less genome, there is one path connecting all nodes • In practice (especially in metagenomes) there are many possible structures in the graph • Edge width represents the number of linking reads (depth)
  • 33.
    Possible structures inDe Bruijn graphs • Cycle: path converges on itself – Repeated region on the same contig • Frayed rope: converge then diverge – Repeated region on different contigs • Bubble: paths diverge then converge – Sequencing error in the middle of a read – Polymorphisms • Spur: short dead-ends – Sequencing error at the end of a read – Zero coverage shortly after end of repeat
  • 34.
    What is easierto assemble? Random sequences, or real genomes?
  • 35.
    Examples of DeBruijn graphs 1% 5% 10% 15% Pell PNAS 2012 Sequencing errors in a random circular sequence Peng Bioinformatics 2011 Five E. coli subspecies
  • 36.
    Random versus realsequences  Biological sequences are not random  Genes, operons, promoters, etcetera  Biased nucleotide usage (GC content)  Biased oligonucleotide usage (k-mers)  Repeated sequences in (meta-)genomes  Low-complexity regions  Conserved protein domains  Duplicated genes, horizontal transfers  “Selfish” elements (e.g. transposons, prophages)  Polymorphic repeats (haplotypes, strains)  …etcetera
  • 37.
    Repeats have multiplesinks/sources
  • 38.
    Repeats have multiplesinks/sources 16s Salmonella has 7 rrn operons Salmonella recombines at rrn operons Helm and Maloy
  • 39.
    Repeated regions • Inoverlap-layout-consensus and De Bruijn graphs reads K-mers Li BFG 2012
  • 40.
    Genome versus metagenome •Depending on diversity – Expect many sequences – Fragmented sequences – Varying read depth • Natural microdiversity • Sequencing errors or natural diversity? • Repeats also include closely related strains, conserved genes, etc. • Depending on coverage – Expect single sequence – Contiguous sequence – Even read depth • Clonal sequence • Identify sequencing errors by low coverage • Repeats consist of duplicated genes and conserved domains
  • 41.
    Chimerization in metagenomeassembly • Both OLC and DBG include “chimera protection” – Break contigs at ambiguities – Works if depth/coverage is high enough contig1 contig2 contig4 contig5 contig3 • Assess final result with different parameters – High versus low stringency assembly • Chimerization is more frequent between closely related strains
  • 42.
    Assembly strategies • Reference-guidedassembly – Align reads to a (database) of reference genome(s) – Cannot discover: • Larger genomic mutations – Insertions, deletions, rearrangements • Distantly related species • Most viruses • De novo assembly – Requires sufficient read lengths, depth, and coverage – Breaks on long repeats and low-coverage regions – Algorithms • Greedy assembly (only to illustrate) • Overlap-layout-consensus • De Bruijn graph
  • 43.
    Scaffolding • Use alignmentsto a related genome sequences to sort and orient de novo contigs Silva et al. Source Code Biol. Med. 2013