Genome Assembly
MICROBIO 590B Bioinformatics Lab: Bacterial Genomics
Professor Kristen DeAngelis
UMass Amherst
Fall 2022
1
Lecture Learning Goals
• Define metagenome assembled genomes (MAGs), and explain how
sequence assembly can help to reveal novel microbial diversity.
• Define the hallmarks of good sequence assembly, including coverage,
read length, quality and others.
• Distinguish between sequence reads, contigs, and scaffolds.
• Describe the conceptual idea behind how de Bruijn graphs are
employed in sequence assembly.
2
Molecular approaches aka ‘Omics
• Amplicon-based sequencing
• limited capacity for discovery
• Better for phylogenetics
because traits can be aligned
• Shotgun sequencing
• Lots of room for discovery
• Half of sequences have no
known homology
Detecting unculturable bacteria
• Genomics – sequencing whole genomes, DNA
• Transcriptomics – sequencing RNA from genomes
• Proteomics – sequencing proteins
• Metabolomics – identification of all metabolites, usually by
analytical chemistry
• Meta-
• Add the prefix “meta-” to any of the above, and it refers to sequencing
mixed communities instead of single species
• For example, Metagenomics – sequencing mixed communities
• For example, Metatranscriptomics – sequencing RNA from mixed
communities
5
6
Bacterial phyla with
few to no cultivated
representatives
Bacterial phyla with few to no cultivated representatives
• The 13 “main” phyla were described by Carl Woese in his classic 1987
paper
• Since then, sequencing revealed that microbial diversity is much
greater than this!
• Most cultivated, characterized bacteria fall into one of five phyla
• Proteoabcteria, Firmicutes, Actinobacteria, Bacteroidetes and Cyanobacteria
• Animal diversity is similar: most animal species belong to a few animal phyla
e.g. nematodes (round worms) and arthropods (insects)
Assembling whole genomes from metagenomic data
using binning
• Short sequences are “binned”
based on shared characteristics
• GC content
• taxonomy
• Coverage (sequence depth)
• K-mer frequency
http://dx.doi.org/10.3389/fmicb.2015.01451
10
Genome
Assembly
• Millions or Billions of
pieces
• Many malformed pieces à
errors
• Missing pieces à coverage
• Pieces mixed from another
puzzle à contamination
• Lots of identical blue sky à
repetitive regions
11
12
Assembling a book from word fragments ...
13
Genomes are assembled from DNA fragments
14
Sequence reads à Contigs à Scaffolds
15
Sequence reads à Contigs à Scaffolds
16
Methods of Genome Assembly: Graph-based methods
17
Methods of Genome Assembly: de Bruijn graphs
18
Methods of Genome Assembly: de Bruijn graphs
• Make one node for every distinct word or phrase
• For every pair of adjacent words in the phrases, add a corresponding
directed edge
• Graph can have more than one edge pointing from one node to
another (multigraph)
19
tomorrow and tomorrow and tomorrow
tomorrow and
Methods of Genome Assembly: de Bruijn graphs
20
Methods of Genome Assembly: de Bruijn graphs
21
Methods of Genome Assembly: de Bruijn graphs
22
Methods of Genome Assembly: de Bruijn graphs
23
Methods of Genome Assembly: de Bruijn graphs
24
Methods of Genome Assembly: de Bruijn graphs
25
Methods of Genome Assembly: de Bruijn graphs
26
Methods of Genome Assembly: de Bruijn graphs
27
Methods of Genome Assembly: de Bruijn graphs
28
Methods of Genome Assembly: de Bruijn graphs
29
Methods of Genome Assembly: de Bruijn graphs
30
Methods of Genome Assembly: de Bruijn graphs
31
Methods of Genome Assembly: de Bruijn graphs
32
Methods of Genome Assembly: de Bruijn graphs
33
Methods of Genome Assembly: de Bruijn graphs
34
35
36
37
Dealing with Double-Strandedness in DNA Sequences
38
Dealing with Double-Strandedness in DNA Sequences
39
What about palindromic sequences?
40
Generating overlapping reads in Illumina Sequencing
41
Generating overlapping reads in Illumina Sequencing
42
Long-read, third generation sequencing technology
43
What makes a good assembly?
44
What makes a good assembly?
45
What makes a good assembly?
46
Coverage
• Low coverage
• Beginning of rain fall
• Random distribution across
landscape
• Some areas remain dry
• High coverage
• Middle/end of storm
• Rain covers landscape
• Some puddles get deep
47
Coverage
• Coverage (or depth) in DNA sequencing is the number of unique
reads that include a given nucleotide in the reconstructed sequence
• Average coverage per genome is the number of reads (N) multiplied
by the read length (L) divided by the genome size (G)
48
1X coverage genome sequence
49
John, Gibbons, via
2X coverage genome sequence
50
John, Gibbons, via
4X coverage genome sequence
51
John, Gibbons, via
8X coverage genome sequence
52
John, Gibbons, via
Poisson distribution of genome coverage
53
John, Gibbons, via
At 20X coverage, only 45% of
the genome is covered >20X
At 40X coverage, 95% of the
genome is covered >30X
Coverage
54
55
Measures of Genome Assembly Quality
• Expectations
• Bacteria tend to have all genes on one circular chromosome
• Bacterial genomes tend to be between 1 and 10 Mbp
• Presence of certain single-copy ‘housekeeping’ marker genes
• Measures
• Contigs: few and long
• Contamination: the occurrence of more than one in a given genome or bin,
since marker genes are typically single-copy
• Completeness: value obtained by the proportion of the missing marker genes
to the total number of markers used
56
Measures of Genome Assembly Quality
• N50, N75
• N50 is the length for which the collection of all contigs of that length or longer
covers at least half (50%) the total base content of the Assembly.
• It serves as a median value for assessing whether the Assembly is balanced
towards longer contigs (higher N50) or shorter contigs (lower N50).
• N75 is used for the same purpose but is the length is set at 75% of total base
content instead of 50%.
57
https://hoytpr.github.io/bioinformatics-semester/materials/genomics-assembly-reporting/
Measures of Genome Assembly Quality
• N50, N75
• N50 is the length for which the collection of all contigs of that length or longer
covers at least half (50%) the total base content of the Assembly.
• It serves as a median value for assessing whether the Assembly is balanced
towards longer contigs (higher N50) or shorter contigs (lower N50).
• N75 is used for the same purpose but is the length is set at 75% of total base
content instead of 50%.
58
Measures of Genome Assembly Quality
• L50, L75:
• L50 is the number of contigs equal to or longer than the N50 length. In other
words, L50, is the minimal number of contigs that contain half the total base
content of the Assembly.
• L75 is used for the same purpose in reference to the N75 length.
59
Measures of Genome Assembly Quality
• Completeness – % essential single copy genes present
• Contamination - % duplication of essential single copy genes
60
Parks et al., Genome Research. 2015
61
Bowers et al. Nat. Biotech, 2017
Lecture Learning Goals
• Define metagenome assembled genomes (MAGs), and explain how
sequence assembly can help to reveal novel microbial diversity.
• Define the hallmarks of good sequence assembly, including coverage,
read length, quality and others.
• Distinguish between sequence reads, contigs, and scaffolds.
• Describe the conceptual idea behind how de Bruijn graphs are
employed in sequence assembly.
62
63

04_Assembly_2022.pdf

  • 1.
    Genome Assembly MICROBIO 590BBioinformatics Lab: Bacterial Genomics Professor Kristen DeAngelis UMass Amherst Fall 2022 1
  • 2.
    Lecture Learning Goals •Define metagenome assembled genomes (MAGs), and explain how sequence assembly can help to reveal novel microbial diversity. • Define the hallmarks of good sequence assembly, including coverage, read length, quality and others. • Distinguish between sequence reads, contigs, and scaffolds. • Describe the conceptual idea behind how de Bruijn graphs are employed in sequence assembly. 2
  • 3.
    Molecular approaches aka‘Omics • Amplicon-based sequencing • limited capacity for discovery • Better for phylogenetics because traits can be aligned • Shotgun sequencing • Lots of room for discovery • Half of sequences have no known homology
  • 4.
    Detecting unculturable bacteria •Genomics – sequencing whole genomes, DNA • Transcriptomics – sequencing RNA from genomes • Proteomics – sequencing proteins • Metabolomics – identification of all metabolites, usually by analytical chemistry • Meta- • Add the prefix “meta-” to any of the above, and it refers to sequencing mixed communities instead of single species • For example, Metagenomics – sequencing mixed communities • For example, Metatranscriptomics – sequencing RNA from mixed communities
  • 5.
  • 6.
  • 8.
    Bacterial phyla with fewto no cultivated representatives
  • 9.
    Bacterial phyla withfew to no cultivated representatives • The 13 “main” phyla were described by Carl Woese in his classic 1987 paper • Since then, sequencing revealed that microbial diversity is much greater than this! • Most cultivated, characterized bacteria fall into one of five phyla • Proteoabcteria, Firmicutes, Actinobacteria, Bacteroidetes and Cyanobacteria • Animal diversity is similar: most animal species belong to a few animal phyla e.g. nematodes (round worms) and arthropods (insects)
  • 10.
    Assembling whole genomesfrom metagenomic data using binning • Short sequences are “binned” based on shared characteristics • GC content • taxonomy • Coverage (sequence depth) • K-mer frequency http://dx.doi.org/10.3389/fmicb.2015.01451 10
  • 11.
    Genome Assembly • Millions orBillions of pieces • Many malformed pieces à errors • Missing pieces à coverage • Pieces mixed from another puzzle à contamination • Lots of identical blue sky à repetitive regions 11
  • 12.
  • 13.
    Assembling a bookfrom word fragments ... 13
  • 14.
    Genomes are assembledfrom DNA fragments 14
  • 15.
    Sequence reads àContigs à Scaffolds 15
  • 16.
    Sequence reads àContigs à Scaffolds 16
  • 17.
    Methods of GenomeAssembly: Graph-based methods 17
  • 18.
    Methods of GenomeAssembly: de Bruijn graphs 18
  • 19.
    Methods of GenomeAssembly: de Bruijn graphs • Make one node for every distinct word or phrase • For every pair of adjacent words in the phrases, add a corresponding directed edge • Graph can have more than one edge pointing from one node to another (multigraph) 19 tomorrow and tomorrow and tomorrow tomorrow and
  • 20.
    Methods of GenomeAssembly: de Bruijn graphs 20
  • 21.
    Methods of GenomeAssembly: de Bruijn graphs 21
  • 22.
    Methods of GenomeAssembly: de Bruijn graphs 22
  • 23.
    Methods of GenomeAssembly: de Bruijn graphs 23
  • 24.
    Methods of GenomeAssembly: de Bruijn graphs 24
  • 25.
    Methods of GenomeAssembly: de Bruijn graphs 25
  • 26.
    Methods of GenomeAssembly: de Bruijn graphs 26
  • 27.
    Methods of GenomeAssembly: de Bruijn graphs 27
  • 28.
    Methods of GenomeAssembly: de Bruijn graphs 28
  • 29.
    Methods of GenomeAssembly: de Bruijn graphs 29
  • 30.
    Methods of GenomeAssembly: de Bruijn graphs 30
  • 31.
    Methods of GenomeAssembly: de Bruijn graphs 31
  • 32.
    Methods of GenomeAssembly: de Bruijn graphs 32
  • 33.
    Methods of GenomeAssembly: de Bruijn graphs 33
  • 34.
    Methods of GenomeAssembly: de Bruijn graphs 34
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    Generating overlapping readsin Illumina Sequencing 41
  • 42.
    Generating overlapping readsin Illumina Sequencing 42
  • 43.
    Long-read, third generationsequencing technology 43
  • 44.
    What makes agood assembly? 44
  • 45.
    What makes agood assembly? 45
  • 46.
    What makes agood assembly? 46
  • 47.
    Coverage • Low coverage •Beginning of rain fall • Random distribution across landscape • Some areas remain dry • High coverage • Middle/end of storm • Rain covers landscape • Some puddles get deep 47
  • 48.
    Coverage • Coverage (ordepth) in DNA sequencing is the number of unique reads that include a given nucleotide in the reconstructed sequence • Average coverage per genome is the number of reads (N) multiplied by the read length (L) divided by the genome size (G) 48
  • 49.
    1X coverage genomesequence 49 John, Gibbons, via
  • 50.
    2X coverage genomesequence 50 John, Gibbons, via
  • 51.
    4X coverage genomesequence 51 John, Gibbons, via
  • 52.
    8X coverage genomesequence 52 John, Gibbons, via
  • 53.
    Poisson distribution ofgenome coverage 53 John, Gibbons, via At 20X coverage, only 45% of the genome is covered >20X At 40X coverage, 95% of the genome is covered >30X Coverage
  • 54.
  • 55.
  • 56.
    Measures of GenomeAssembly Quality • Expectations • Bacteria tend to have all genes on one circular chromosome • Bacterial genomes tend to be between 1 and 10 Mbp • Presence of certain single-copy ‘housekeeping’ marker genes • Measures • Contigs: few and long • Contamination: the occurrence of more than one in a given genome or bin, since marker genes are typically single-copy • Completeness: value obtained by the proportion of the missing marker genes to the total number of markers used 56
  • 57.
    Measures of GenomeAssembly Quality • N50, N75 • N50 is the length for which the collection of all contigs of that length or longer covers at least half (50%) the total base content of the Assembly. • It serves as a median value for assessing whether the Assembly is balanced towards longer contigs (higher N50) or shorter contigs (lower N50). • N75 is used for the same purpose but is the length is set at 75% of total base content instead of 50%. 57 https://hoytpr.github.io/bioinformatics-semester/materials/genomics-assembly-reporting/
  • 58.
    Measures of GenomeAssembly Quality • N50, N75 • N50 is the length for which the collection of all contigs of that length or longer covers at least half (50%) the total base content of the Assembly. • It serves as a median value for assessing whether the Assembly is balanced towards longer contigs (higher N50) or shorter contigs (lower N50). • N75 is used for the same purpose but is the length is set at 75% of total base content instead of 50%. 58
  • 59.
    Measures of GenomeAssembly Quality • L50, L75: • L50 is the number of contigs equal to or longer than the N50 length. In other words, L50, is the minimal number of contigs that contain half the total base content of the Assembly. • L75 is used for the same purpose in reference to the N75 length. 59
  • 60.
    Measures of GenomeAssembly Quality • Completeness – % essential single copy genes present • Contamination - % duplication of essential single copy genes 60 Parks et al., Genome Research. 2015
  • 61.
    61 Bowers et al.Nat. Biotech, 2017
  • 62.
    Lecture Learning Goals •Define metagenome assembled genomes (MAGs), and explain how sequence assembly can help to reveal novel microbial diversity. • Define the hallmarks of good sequence assembly, including coverage, read length, quality and others. • Distinguish between sequence reads, contigs, and scaffolds. • Describe the conceptual idea behind how de Bruijn graphs are employed in sequence assembly. 62
  • 63.