SlideShare a Scribd company logo
1 of 43
Bas E. Dutilh
Bacteriófagos: Aspectos básicos y moleculares. Aplicaciones Biotecnológicas
Buenos Aires, June 29th 2015
Shotgun sequence assembly
Method Read length Accuracy Million
reads
Time Cost
per M
454 100-700 99% 1 1 day $10
Illumina 50-300 98% 3,000 1-2 days $0.10
IonTorrent 100-400 98% 40-80 2 hours $1
PacBio 1,000-30,000 87% 0.05 2 hours $1
Sanger 400-1,200 99.9% n/a 2 hours $2,400
SOLiD 50 99.9% 1,200 1-2 weeks $0.13
Sequencing specs*
* these numbers change all the time!
Lengths of reads and genomes
 NGS technologies provide reads of 50 to max.
30,000 bp, but most genomes are much longer
Gago, Science 2009
Nucleotide codes
Description Bases
A Adenine A
1
C Cytosine C
G Guanine G
T Thymine T
U Uracil U
W Weak A T
2
S Strong C G
M aMino A C
K Keto G T
R puRine A G
Y pYrimidine C T
B not A (B after A) C G T
3
D not C (D after C) A G T
H not G (H after G) A C T
V not T (V after T/U) A C G
N aNy base (not a gap) A C G T 4
- Gap (no nucleotide) 0
Sequence File Formats
• Different file formats for different uses
• Competing formats developed in parallel
• Some easy to read, some easy to parse
• Simplest sequence file format
• Unique identifiers!
• “Fasta wide” format has the whole sequence on one line
• Even easier to parse in a computer script
Fasta
>identifier1 [optional information]
CCGATCATATGACTAGCATGCATCGATCGATCGACTAGCATTT
AGAGCTACGATCAGCACTACACGCTTTGTATGATTGGCGGCGG
CTATTATATTGGGA
>identifier2 [optional information]
GAGAGCTACGATCAGAGCTACGATCAGCACTACACGCTTTGTA
TGATTGGCCCCCTATATTGGGACACGATCAGCACTACACGCTT
TGTATGATTGGCGGCGGCTATCCGATCAT
• Based on Fasta format
• Contains information about quality of each nucleotide
• Quality estimated by sequencing machine
@SRR014849.1 EIXKN4201CFU84 length=93
GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC
+
hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E%
• Four lines per sequence:
1. Identifier line starting with @
2. DNA sequence on one line
3. Second identifier line starting with + (identifier optional)
4. String of quality scores on one line
Fastq
Quality scores
 Phred 10: 10-1 chance that the base is wrong
 90% accuracy; 10% error rate
 Phred 20: 10-2 chance that the base is wrong
 99% accuracy ; 1% error rate
 Phred 30: 10-3 chance that the base is wrong
 99.9% accuracy ; 0.1% error rate
 Etcetera
ASCII character codes
 Fastq quality score: Phred score + 33, converted to
ASCII text
 Note: old Illumina format was different!
@SRR014849.1 EIXKN4201CFU84 length=93
GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC
+
hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E%
Quality profile of reads
March 2011
Quality profile of reads
October 2011
Random genome, random coverage
• Average depth:
– Genome size G
– Base depth B=40x
– Read length L=100 bp
– K-mer size K=25 bp
C = B * (L - K + 1) / L
• Uncovered bases:
u = G * eC
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
What is easier to assemble?
Random sequences, or real genomes?
Sequence assembly
Reads
Scaffold
 Order and orientation of contigs
 Sizes of the gaps between contigs (filled with NNN)
Contigs
 Consensus sequence of assembled reads
 Includes alignment of all reads
Horizontal coverage
Depth
Coverage
Assembly of shotgun sequences
• Human genome project
–1-2 kb Sanger reads
–< 10x coverage
–Low error rate
• High-throughput (meta-)genomics
– Millions/billions of ~100-400 bp reads
– Mix of genomes with different coverage
– Biases and sequencing errors
• Quality drops towards the end of reads
• Homo-polymers may be miss-called in 454 or Ion Torrent
2000
NOW
Assembly strategies
• Reference-guided assembly
– Align reads to a (database) of reference genome(s)
– Cannot discover:
• Larger genomic mutations
– Insertions, deletions, rearrangements
• Distantly related species
• Most viruses
• De novo assembly
– Requires sufficient coverage x depth
– Breaks on repeats and low-coverage regions
– Algorithms
• Greedy assembly (only to illustrate)
• Overlap-layout-consensus
• De Bruijn graph
Reference-guided assembly
• Illumina sequencing of community DNA
• Same-species genome available (2.8M nt)
• Sometimes, only a minority of the reads can be
mapped/aligned
Distant reference
• Natural diversity of community
– “Species” share >94% average nucleotide identity
– Consensus = “average” of the species
Consensus
Genome space
Reference
Konstantinidis and
Tiedje, PNAS 2004
• The assembly is a better representation of the community
• Can we further approach the consensus genome by re-
mapping the reads against this first assembly?
Reference
Genome space
Iterative mapping and assembly
First assembly
Consensus
Dutilh et al. Bioinformatics 2009
Iteration improves assembly
• More mapped reads
• Fewer gaps
Dutilh et al. Bioinformatics 2009
De novo assembly
Assembly: AACAAGTTA
AACAAGT
CAAGTTA
De novo assembly approaches
• Greedy approach
• Overlap-layout-consensus
• De Bruijn graphs
Greedy assembly
1. Sequences (reads)
2. Pairwise all-vs-all similarities
3. Find best matching pair
4. Collapse/assemble
• Works well for few, long reads (Sanger)
– All-vs-all calculations are expensive
– One clear best match
• Does not work for high throughput NGS datasets
– Many reads -> expensive to calculate
– Low coverage requires graph approach
(reads/contigs)
Repetitive sequences
• Reads A-D are from a region with two long repeats
• Greedy approach would first join A-D with the
largest overlap, and place B-C in a separate contig
• Resolving this requires a global view of all the
possibilities before joining two reads: a graph
repeat repeat
DA
C
B
D C
BA
B
D
What is easier to assemble?
Random sequences, or real genomes?
Assembly as a “graph” problem
• Overlap-layout-consensus
• De Bruijn Graph
• A graph contains nodes and edges
node edge
1. Identify all overlaps between reads
– Use cutoffs: minimum overlap and percent identity
2. Make graph of overlap connections
– Nodes: reads
– Edges: overlaps
3. Find Hamiltonian path
– Path that contains every node once
– No efficient algorithm available
4. Determine consensus at each position
TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
Overlap-layout-consensus
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT
CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT
GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT
CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC
K N
LJ
M
K NLJ M
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
De Bruijn graph
1. Find every word of length k (k-mer) in every read
– K-mer should be long enough to be quite unique, but
– … short enough to not break on polymorphisms/errors
TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT
GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT
CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC
CTTGATACTAATGCTTTTTGTAATCTTAT
TTGATACTAATGCTTTTTGTAATCTTATT
TGATACTAATGCTTTTTGTAATCTTATTG
GATACTAATGCTTTTTGTAATCTTATTGG
ATACTAATGCTTTTTGTAATCTTATTGGT
TACTAATGCTTTTTGTAATCTTATTGGTT
ACTAATGCTTTTTGTAATCTTATTGGTTG
CTAATGCTTTTTGTAATCTTATTGGTTGG
TAATGCTTTTTGTAATCTTATTGGTTGGC
AATGCTTTTTGTAATCTTATTGGTTGGCT
ATGCTTTTTGTAATCTTATTGGTTGGCTT
TGCTTTTTGTAATCTTATTGGTTGGCTTA
GCTTTTTGTAATCTTATTGGTTGGCTTAA
CTTTTTGTAATCTTATTGGTTGGCTTAAA
TTTTTGTAATCTTATTGGTTGGCTTAAAC
K N
L
M
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT
J
De Bruijn graph
2. Make graph of sequential k-mers in sequence
– Nodes: k-mers
– Edges: sequential presence of k-mers in reads
TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT
CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT
GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT
CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC
CTTGATACTAATGCTTTTTGTAATCTTAT
TTGATACTAATGCTTTTTGTAATCTTATT
TGATACTAATGCTTTTTGTAATCTTATTG
GATACTAATGCTTTTTGTAATCTTATTGG
ATACTAATGCTTTTTGTAATCTTATTGGT
TACTAATGCTTTTTGTAATCTTATTGGTT
ACTAATGCTTTTTGTAATCTTATTGGTTG
CTAATGCTTTTTGTAATCTTATTGGTTGG
TAATGCTTTTTGTAATCTTATTGGTTGGC
AATGCTTTTTGTAATCTTATTGGTTGGCT
ATGCTTTTTGTAATCTTATTGGTTGGCTT
TGCTTTTTGTAATCTTATTGGTTGGCTTA
GCTTTTTGTAATCTTATTGGTTGGCTTAA
CTTTTTGTAATCTTATTGGTTGGCTTAAA
TTTTTGTAATCTTATTGGTTGGCTTAAAC
K N
L
J
M
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
De Bruijn graph
CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
3. Find Eulerian path
– Path that contains every edge once
– Efficient algorithm available
• In an optimal sequencing run of a repeat-less
genome, there is one path connecting all nodes
• In practice (especially in metagenomes) there are
many possible structures in the graph
• Edge width represents the number of linking
reads (depth)
Possible structures in De Bruijn graphs
• Cycle: path converges on itself
– Repeated region on the same contig
• Frayed rope: converge then diverge
– Repeated region on different contigs
• Bubble: paths diverge then converge
– Sequencing error in the middle of a read
– Polymorphisms
• Spur: short dead-ends
– Sequencing error at the end of a read
– Zero coverage shortly after end of repeat
What is easier to assemble?
Random sequences, or real genomes?
Examples of De Bruijn graphs
1% 5%
10% 15%
Pell PNAS 2012
Sequencing errors in a
random circular sequence
Peng Bioinformatics 2011
Five E. coli subspecies
Random versus real sequences
 Biological sequences are not random
 Genes, operons, promoters, etcetera
 Biased nucleotide usage (GC content)
 Biased oligonucleotide usage (k-mers)
 Repeated sequences in (meta-)genomes
 Low-complexity regions
 Conserved protein domains
 Duplicated genes, horizontal transfers
 “Selfish” elements (e.g. transposons, prophages)
 Polymorphic repeats (haplotypes, strains)
 …etcetera
Repeats have multiple sinks/sources
Repeats have multiple sinks/sources
16s
Salmonella has 7 rrn operons
Salmonella recombines at rrn operons
Helm and Maloy
Repeated regions
• In overlap-layout-consensus and De Bruijn graphs
reads
K-mers
Li BFG 2012
Genome versus metagenome
• Depending on diversity
– Expect many sequences
– Fragmented sequences
– Varying read depth
• Natural microdiversity
• Sequencing errors or
natural diversity?
• Repeats also include
closely related strains,
conserved genes, etc.
• Depending on coverage
– Expect single sequence
– Contiguous sequence
– Even read depth
• Clonal sequence
• Identify sequencing
errors by low coverage
• Repeats consist of
duplicated genes and
conserved domains
Chimerization in metagenome assembly
• Both OLC and DBG include “chimera protection”
– Break contigs at ambiguities
– Works if depth/coverage is high enough
contig1
contig2
contig4
contig5
contig3
• Assess final result with different parameters
– High versus low stringency assembly
• Chimerization is more frequent between
closely related strains
Assembly strategies
• Reference-guided assembly
– Align reads to a (database) of reference genome(s)
– Cannot discover:
• Larger genomic mutations
– Insertions, deletions, rearrangements
• Distantly related species
• Most viruses
• De novo assembly
– Requires sufficient read lengths, depth, and coverage
– Breaks on long repeats and low-coverage regions
– Algorithms
• Greedy assembly (only to illustrate)
• Overlap-layout-consensus
• De Bruijn graph
Scaffolding
• Use alignments to a related genome sequences to
sort and orient de novo contigs
Silva et al. Source Code Biol. Med. 2013

More Related Content

What's hot

rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomics
Francisco Garc
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
c.titus.brown
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
Nikolay Vyahhi
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
c.titus.brown
 

What's hot (19)

Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
rnaseq_from_babelomics
rnaseq_from_babelomicsrnaseq_from_babelomics
rnaseq_from_babelomics
 
London Calling 2019: Karen Miga
London Calling 2019: Karen MigaLondon Calling 2019: Karen Miga
London Calling 2019: Karen Miga
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
NGS overview
NGS overviewNGS overview
NGS overview
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...Generating high-quality human reference genomes using PromethION nanopore seq...
Generating high-quality human reference genomes using PromethION nanopore seq...
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Bioinformatics on GPU
Bioinformatics on GPUBioinformatics on GPU
Bioinformatics on GPU
 
2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge2015 bioinformatics databases_wim_vancriekinge
2015 bioinformatics databases_wim_vancriekinge
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
2013 pag-equine-workshop
2013 pag-equine-workshop2013 pag-equine-workshop
2013 pag-equine-workshop
 

Similar to Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)

Decoding ancient Bulgarian DNA with semiconductor-based sequencing
Decoding ancient Bulgarian DNA with semiconductor-based sequencingDecoding ancient Bulgarian DNA with semiconductor-based sequencing
Decoding ancient Bulgarian DNA with semiconductor-based sequencing
Thermo Fisher Scientific
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
c.titus.brown
 
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...
Towards Ultra-Large-Scale System:  Design of Scalable Software and Next-Gen H...Towards Ultra-Large-Scale System:  Design of Scalable Software and Next-Gen H...
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...
Arghya Kusum Das
 
Aug2013 tumor normal whole genome sequencing
Aug2013 tumor normal whole genome sequencingAug2013 tumor normal whole genome sequencing
Aug2013 tumor normal whole genome sequencing
GenomeInABottle
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
Computer Science Club
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
c.titus.brown
 
The Human Genome Project - Part I
The Human Genome Project - Part IThe Human Genome Project - Part I
The Human Genome Project - Part I
hhalhaddad
 

Similar to Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires) (20)

Decoding ancient Bulgarian DNA with semiconductor-based sequencing
Decoding ancient Bulgarian DNA with semiconductor-based sequencingDecoding ancient Bulgarian DNA with semiconductor-based sequencing
Decoding ancient Bulgarian DNA with semiconductor-based sequencing
 
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
Cufflinks
CufflinksCufflinks
Cufflinks
 
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...
Towards Ultra-Large-Scale System:  Design of Scalable Software and Next-Gen H...Towards Ultra-Large-Scale System:  Design of Scalable Software and Next-Gen H...
Towards Ultra-Large-Scale System: Design of Scalable Software and Next-Gen H...
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packages
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elements
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
Real-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerReal-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe Parker
 
Blast fasta 4
Blast fasta 4Blast fasta 4
Blast fasta 4
 
Aug2013 tumor normal whole genome sequencing
Aug2013 tumor normal whole genome sequencingAug2013 tumor normal whole genome sequencing
Aug2013 tumor normal whole genome sequencing
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 
The Human Genome Project - Part I
The Human Genome Project - Part IThe Human Genome Project - Part I
The Human Genome Project - Part I
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 

Recently uploaded

Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
Silpa
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
Silpa
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 

Recently uploaded (20)

Cot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNACot curve, melting temperature, unique and repetitive DNA
Cot curve, melting temperature, unique and repetitive DNA
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Site specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdfSite specific recombination and transposition.........pdf
Site specific recombination and transposition.........pdf
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Genome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptxGenome organization in virus,bacteria and eukaryotes.pptx
Genome organization in virus,bacteria and eukaryotes.pptx
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
PODOCARPUS...........................pptx
PODOCARPUS...........................pptxPODOCARPUS...........................pptx
PODOCARPUS...........................pptx
 
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot GirlsKanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
Kanchipuram Escorts 🥰 8617370543 Call Girls Offer VIP Hot Girls
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 

Metagenome Sequence Assembly (CABBIO 20150629 Buenos Aires)

  • 1. Bas E. Dutilh Bacteriófagos: Aspectos básicos y moleculares. Aplicaciones Biotecnológicas Buenos Aires, June 29th 2015 Shotgun sequence assembly
  • 2. Method Read length Accuracy Million reads Time Cost per M 454 100-700 99% 1 1 day $10 Illumina 50-300 98% 3,000 1-2 days $0.10 IonTorrent 100-400 98% 40-80 2 hours $1 PacBio 1,000-30,000 87% 0.05 2 hours $1 Sanger 400-1,200 99.9% n/a 2 hours $2,400 SOLiD 50 99.9% 1,200 1-2 weeks $0.13 Sequencing specs* * these numbers change all the time!
  • 3.
  • 4. Lengths of reads and genomes  NGS technologies provide reads of 50 to max. 30,000 bp, but most genomes are much longer Gago, Science 2009
  • 5. Nucleotide codes Description Bases A Adenine A 1 C Cytosine C G Guanine G T Thymine T U Uracil U W Weak A T 2 S Strong C G M aMino A C K Keto G T R puRine A G Y pYrimidine C T B not A (B after A) C G T 3 D not C (D after C) A G T H not G (H after G) A C T V not T (V after T/U) A C G N aNy base (not a gap) A C G T 4 - Gap (no nucleotide) 0
  • 6. Sequence File Formats • Different file formats for different uses • Competing formats developed in parallel • Some easy to read, some easy to parse
  • 7. • Simplest sequence file format • Unique identifiers! • “Fasta wide” format has the whole sequence on one line • Even easier to parse in a computer script Fasta >identifier1 [optional information] CCGATCATATGACTAGCATGCATCGATCGATCGACTAGCATTT AGAGCTACGATCAGCACTACACGCTTTGTATGATTGGCGGCGG CTATTATATTGGGA >identifier2 [optional information] GAGAGCTACGATCAGAGCTACGATCAGCACTACACGCTTTGTA TGATTGGCCCCCTATATTGGGACACGATCAGCACTACACGCTT TGTATGATTGGCGGCGGCTATCCGATCAT
  • 8. • Based on Fasta format • Contains information about quality of each nucleotide • Quality estimated by sequencing machine @SRR014849.1 EIXKN4201CFU84 length=93 GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC + hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E% • Four lines per sequence: 1. Identifier line starting with @ 2. DNA sequence on one line 3. Second identifier line starting with + (identifier optional) 4. String of quality scores on one line Fastq
  • 9. Quality scores  Phred 10: 10-1 chance that the base is wrong  90% accuracy; 10% error rate  Phred 20: 10-2 chance that the base is wrong  99% accuracy ; 1% error rate  Phred 30: 10-3 chance that the base is wrong  99.9% accuracy ; 0.1% error rate  Etcetera
  • 10. ASCII character codes  Fastq quality score: Phred score + 33, converted to ASCII text  Note: old Illumina format was different! @SRR014849.1 EIXKN4201CFU84 length=93 GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC + hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E%
  • 11. Quality profile of reads March 2011
  • 12. Quality profile of reads October 2011
  • 13. Random genome, random coverage • Average depth: – Genome size G – Base depth B=40x – Read length L=100 bp – K-mer size K=25 bp C = B * (L - K + 1) / L • Uncovered bases: u = G * eC CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
  • 14. What is easier to assemble? Random sequences, or real genomes?
  • 15. Sequence assembly Reads Scaffold  Order and orientation of contigs  Sizes of the gaps between contigs (filled with NNN) Contigs  Consensus sequence of assembled reads  Includes alignment of all reads
  • 17. Assembly of shotgun sequences • Human genome project –1-2 kb Sanger reads –< 10x coverage –Low error rate • High-throughput (meta-)genomics – Millions/billions of ~100-400 bp reads – Mix of genomes with different coverage – Biases and sequencing errors • Quality drops towards the end of reads • Homo-polymers may be miss-called in 454 or Ion Torrent 2000 NOW
  • 18. Assembly strategies • Reference-guided assembly – Align reads to a (database) of reference genome(s) – Cannot discover: • Larger genomic mutations – Insertions, deletions, rearrangements • Distantly related species • Most viruses • De novo assembly – Requires sufficient coverage x depth – Breaks on repeats and low-coverage regions – Algorithms • Greedy assembly (only to illustrate) • Overlap-layout-consensus • De Bruijn graph
  • 19. Reference-guided assembly • Illumina sequencing of community DNA • Same-species genome available (2.8M nt) • Sometimes, only a minority of the reads can be mapped/aligned
  • 20. Distant reference • Natural diversity of community – “Species” share >94% average nucleotide identity – Consensus = “average” of the species Consensus Genome space Reference Konstantinidis and Tiedje, PNAS 2004
  • 21. • The assembly is a better representation of the community • Can we further approach the consensus genome by re- mapping the reads against this first assembly? Reference Genome space Iterative mapping and assembly First assembly Consensus Dutilh et al. Bioinformatics 2009
  • 22. Iteration improves assembly • More mapped reads • Fewer gaps Dutilh et al. Bioinformatics 2009
  • 23. De novo assembly Assembly: AACAAGTTA AACAAGT CAAGTTA
  • 24. De novo assembly approaches • Greedy approach • Overlap-layout-consensus • De Bruijn graphs
  • 25. Greedy assembly 1. Sequences (reads) 2. Pairwise all-vs-all similarities 3. Find best matching pair 4. Collapse/assemble • Works well for few, long reads (Sanger) – All-vs-all calculations are expensive – One clear best match • Does not work for high throughput NGS datasets – Many reads -> expensive to calculate – Low coverage requires graph approach (reads/contigs)
  • 26. Repetitive sequences • Reads A-D are from a region with two long repeats • Greedy approach would first join A-D with the largest overlap, and place B-C in a separate contig • Resolving this requires a global view of all the possibilities before joining two reads: a graph repeat repeat DA C B D C BA B D
  • 27. What is easier to assemble? Random sequences, or real genomes?
  • 28. Assembly as a “graph” problem • Overlap-layout-consensus • De Bruijn Graph • A graph contains nodes and edges node edge
  • 29. 1. Identify all overlaps between reads – Use cutoffs: minimum overlap and percent identity 2. Make graph of overlap connections – Nodes: reads – Edges: overlaps 3. Find Hamiltonian path – Path that contains every node once – No efficient algorithm available 4. Determine consensus at each position TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA Overlap-layout-consensus CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC K N LJ M K NLJ M CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
  • 30. De Bruijn graph 1. Find every word of length k (k-mer) in every read – K-mer should be long enough to be quite unique, but – … short enough to not break on polymorphisms/errors TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC CTTGATACTAATGCTTTTTGTAATCTTAT TTGATACTAATGCTTTTTGTAATCTTATT TGATACTAATGCTTTTTGTAATCTTATTG GATACTAATGCTTTTTGTAATCTTATTGG ATACTAATGCTTTTTGTAATCTTATTGGT TACTAATGCTTTTTGTAATCTTATTGGTT ACTAATGCTTTTTGTAATCTTATTGGTTG CTAATGCTTTTTGTAATCTTATTGGTTGG TAATGCTTTTTGTAATCTTATTGGTTGGC AATGCTTTTTGTAATCTTATTGGTTGGCT ATGCTTTTTGTAATCTTATTGGTTGGCTT TGCTTTTTGTAATCTTATTGGTTGGCTTA GCTTTTTGTAATCTTATTGGTTGGCTTAA CTTTTTGTAATCTTATTGGTTGGCTTAAA TTTTTGTAATCTTATTGGTTGGCTTAAAC K N L M CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT J
  • 31. De Bruijn graph 2. Make graph of sequential k-mers in sequence – Nodes: k-mers – Edges: sequential presence of k-mers in reads TTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAAT CTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCT GTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTT CTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAAC CTTGATACTAATGCTTTTTGTAATCTTAT TTGATACTAATGCTTTTTGTAATCTTATT TGATACTAATGCTTTTTGTAATCTTATTG GATACTAATGCTTTTTGTAATCTTATTGG ATACTAATGCTTTTTGTAATCTTATTGGT TACTAATGCTTTTTGTAATCTTATTGGTT ACTAATGCTTTTTGTAATCTTATTGGTTG CTAATGCTTTTTGTAATCTTATTGGTTGG TAATGCTTTTTGTAATCTTATTGGTTGGC AATGCTTTTTGTAATCTTATTGGTTGGCT ATGCTTTTTGTAATCTTATTGGTTGGCTT TGCTTTTTGTAATCTTATTGGTTGGCTTA GCTTTTTGTAATCTTATTGGTTGGCTTAA CTTTTTGTAATCTTATTGGTTGGCTTAAA TTTTTGTAATCTTATTGGTTGGCTTAAAC K N L J M CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA
  • 32. De Bruijn graph CTAGTATTATTGCTGCTCATAAAGTAGCTCCAGCTCATCTTGATACTAATGCTTTTTGTAATCTTATTGGTTGGCTTAAACCTAAAAGAGTTGAAGTTAA 3. Find Eulerian path – Path that contains every edge once – Efficient algorithm available • In an optimal sequencing run of a repeat-less genome, there is one path connecting all nodes • In practice (especially in metagenomes) there are many possible structures in the graph • Edge width represents the number of linking reads (depth)
  • 33. Possible structures in De Bruijn graphs • Cycle: path converges on itself – Repeated region on the same contig • Frayed rope: converge then diverge – Repeated region on different contigs • Bubble: paths diverge then converge – Sequencing error in the middle of a read – Polymorphisms • Spur: short dead-ends – Sequencing error at the end of a read – Zero coverage shortly after end of repeat
  • 34. What is easier to assemble? Random sequences, or real genomes?
  • 35. Examples of De Bruijn graphs 1% 5% 10% 15% Pell PNAS 2012 Sequencing errors in a random circular sequence Peng Bioinformatics 2011 Five E. coli subspecies
  • 36. Random versus real sequences  Biological sequences are not random  Genes, operons, promoters, etcetera  Biased nucleotide usage (GC content)  Biased oligonucleotide usage (k-mers)  Repeated sequences in (meta-)genomes  Low-complexity regions  Conserved protein domains  Duplicated genes, horizontal transfers  “Selfish” elements (e.g. transposons, prophages)  Polymorphic repeats (haplotypes, strains)  …etcetera
  • 37. Repeats have multiple sinks/sources
  • 38. Repeats have multiple sinks/sources 16s Salmonella has 7 rrn operons Salmonella recombines at rrn operons Helm and Maloy
  • 39. Repeated regions • In overlap-layout-consensus and De Bruijn graphs reads K-mers Li BFG 2012
  • 40. Genome versus metagenome • Depending on diversity – Expect many sequences – Fragmented sequences – Varying read depth • Natural microdiversity • Sequencing errors or natural diversity? • Repeats also include closely related strains, conserved genes, etc. • Depending on coverage – Expect single sequence – Contiguous sequence – Even read depth • Clonal sequence • Identify sequencing errors by low coverage • Repeats consist of duplicated genes and conserved domains
  • 41. Chimerization in metagenome assembly • Both OLC and DBG include “chimera protection” – Break contigs at ambiguities – Works if depth/coverage is high enough contig1 contig2 contig4 contig5 contig3 • Assess final result with different parameters – High versus low stringency assembly • Chimerization is more frequent between closely related strains
  • 42. Assembly strategies • Reference-guided assembly – Align reads to a (database) of reference genome(s) – Cannot discover: • Larger genomic mutations – Insertions, deletions, rearrangements • Distantly related species • Most viruses • De novo assembly – Requires sufficient read lengths, depth, and coverage – Breaks on long repeats and low-coverage regions – Algorithms • Greedy assembly (only to illustrate) • Overlap-layout-consensus • De Bruijn graph
  • 43. Scaffolding • Use alignments to a related genome sequences to sort and orient de novo contigs Silva et al. Source Code Biol. Med. 2013