Sequence Alignment
MICROBIO 590B Bioinformatics Lab: Bacterial Genomics
Professor Kristen DeAngelis
UMass Amherst
Fall 2022
1
Lecture Learning Goals
• Explain homology and how sequence homology is related to function
or physiology across groups of organisms.
• Describe global sequence alignment, and contrast global versus local
sequence alignment.
• Describe how to find the best alignment, including the measures that
allow one to quantify alignment scores.
• Needleman-Wunsch
• Smith-Waterman
• Basic Local Alignment via BLAST
• Perform a statistical analysis of alignments.
2
The Central Dogma of Biology
• …an example of the insights we
can gain from sequence alignment!
3
Crick, Nature 1970
RNA was the first biological molecule …
4
… with DNA evolving last
5
The replication machinery of LUCA: common origin of DNA
replication and transcription
• Origin of DNA replication is an
enigma because the replicative
DNA polymerases (DNAPs) are
not homologous among the three
domains of life, Bacteria,
Archaea, and Eukarya.
• In the RNA-protein world that
predated the advent of DNA
replication, maybe RNAPs and
replicative DNAPs evolved from a
common ancestor that
functioned as an RNA-dependent
RNA polymerase.
6
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7281927/
Phylogenies describe how organisms are related,
and more closely related organisms share more traits.
7
These blue distances describe evolutionary distance.
The value of changes is described in a scale bar of the
same orientation.
These red lines are spacers for labels, pictures, or data.
These yellow circles are hypothetical common ancestors.
Homology infers shared ancestry
• Homologous characters – characters in different organisms that are
similar because they were inherited from a common ancestor that
also had the character
8
Homology infers shared ancestry
• Analogous characters are the
opposite of homologous characters
9
10
Homologous genes
have a shared ancestry.
• Orthologs arise from a
speciation event –
multiple organisms, one
gene.
• Paralogs arise from a
duplication event – the
same organism, two
different homologous
genes.
The Tree of Life
• Organisms with more
homolologous
molecular traits also
have more shared
physiological traits
11
Woesian ToL: Pace NR, Science 1997
Sequence homology is used to infer shared ancestry
• Conservative mutation encodes an amino acid replacement with similar biochemistry
• Semi-conservative mutation encodes an amino acid replacement with somewhat different
biochemistry, e.g., different charge
• Non-conservative mutations encode amino acid replacements with different biochemistry
12
Sequence alignment
• The task of finding corresponding parts in two related sequences
• Biological Significance
• Prediction of gene/protein function
• Gene Finding
• Species identification (database searching)
• Evolutionary relationships (sequence divergence)
• Sequence Assembly
13
Global sequence alignment
• A representation of the correspondence between all the respective
symbols (i.e. nucleotides or amino acids) of two sequences
• If two sequences (s,t) have the same ancestor, we expect them to
have many symbols and strings in common
• For most symbols, we should be able to identify the corresponding
homologous position in the other sequence
• Represented as an “alignment”
• Mutations appear as mismatches
• Insertions/Deletions appear as gaps
14
Global sequence alignment
15
Global sequence alignment
16
Global sequence alignment
17
Alignment score function
18
Alignment score function
19
Alignment score function
20
Alignment score function
21
Alignment score function
22
Alignment score function
23
Alignment score function
24
A substitution matrix can represent scoring
25
Nucleotide substitution models
• Jukes Cantor model is a one-parameter model
• Two-parameter models only care about whether a substitution
is a transition or transversion
• Six-parameter models weighting each change differently
26
Nucleotide substitution matrixes
27
• Transitions are much more common
than transversions, so these are
weighted differently in deciding
what distance to assign to a
mismatch
• Six-parameter models consider
different types of transitions and
transversions, weighting each
change differently
• Gaps are also tricky… for example,
adjacent gaps are not unrelated
Amino Acid Substitution Models are more Complex
28
• Protein substitution matrices are
significantly more complex than
DNA scoring matrices
• Proteins are composed of 20
amino acids with varying physico-
chemical properties
• Protein substitution matrix can
be based on any property of
amino acids: size, polarity,
charge, hydrophobicity, etc
Amino Acid Substitution Models are more Complex
29
GC content can introduce bias, but is NOT an
indication of poor quality sequence.
• Bias sequence composition of
reads (ex. GC content, 5’-
transcriptome)
• Causes
• Codon or noncoding bias
• Random hexamer priming
(transcriptome studies)
• Library preparation (PCR bias)
• Problems
• Variant calling (SNPs)
• Copy Number Variation (CNV)
• Solution
• Weighing sequence reads based
on first 7 nucleotides
• GC content correction
• RNA-SEQC, CQN, EDA-SEQ
30
How do we find the best alignment?
31
How do we find the best alignment?
32
How do we find the best alignment?
33
How do we find the best alignment?
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
• Like N-W, a dynamic programming
algorithm
• guaranteed to find the optimal local
alignment with respect to the scoring
system being used
• Main difference with N-W is that
negative scoring cells are set to zero, to
highlight local alignments
• Not practical for big alignment
problems…
68
69
70
71
72
73
74
75
76
77
78
Sequence Databases are Enormous!
79
~10 Trillion
Nucleotides
Total!
Sayers et al., Nucleic Acids Research. 2020
Sequence Databases are Enormous!
NCBI SRA
• Raw sequence
data from high
throughput
sequencing
80
Terabase = 1,000 Gb
1 Gb = 1 million bases
9 890 500 490 859 nt GenBank
54 966 126 472 268 601 nt SRA Total
Billion
Trillion
Quadrillion
Million
Basic Local Alignment Search Tool (BLAST)
81
BLAST Output and Values
82
84
85
86
87
88
89
90
91
92
93
94
Lecture Learning Goals
• Explain homology and how sequence homology is related to function
or physiology across groups of organisms.
• Describe global sequence alignment, and contrast global versus local
sequence alignment.
• Describe how to find the best alignment, including the measures that
allow one to quantify alignment scores.
• Needleman-Wunsch
• Smith-Waterman
• Basic Local Alignment via BLAST
• Perform a statistical analysis of alignments.
95
06_Alignment_2022.pdf

06_Alignment_2022.pdf

  • 1.
    Sequence Alignment MICROBIO 590BBioinformatics Lab: Bacterial Genomics Professor Kristen DeAngelis UMass Amherst Fall 2022 1
  • 2.
    Lecture Learning Goals •Explain homology and how sequence homology is related to function or physiology across groups of organisms. • Describe global sequence alignment, and contrast global versus local sequence alignment. • Describe how to find the best alignment, including the measures that allow one to quantify alignment scores. • Needleman-Wunsch • Smith-Waterman • Basic Local Alignment via BLAST • Perform a statistical analysis of alignments. 2
  • 3.
    The Central Dogmaof Biology • …an example of the insights we can gain from sequence alignment! 3 Crick, Nature 1970
  • 4.
    RNA was thefirst biological molecule … 4
  • 5.
    … with DNAevolving last 5
  • 6.
    The replication machineryof LUCA: common origin of DNA replication and transcription • Origin of DNA replication is an enigma because the replicative DNA polymerases (DNAPs) are not homologous among the three domains of life, Bacteria, Archaea, and Eukarya. • In the RNA-protein world that predated the advent of DNA replication, maybe RNAPs and replicative DNAPs evolved from a common ancestor that functioned as an RNA-dependent RNA polymerase. 6 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7281927/
  • 7.
    Phylogenies describe howorganisms are related, and more closely related organisms share more traits. 7 These blue distances describe evolutionary distance. The value of changes is described in a scale bar of the same orientation. These red lines are spacers for labels, pictures, or data. These yellow circles are hypothetical common ancestors.
  • 8.
    Homology infers sharedancestry • Homologous characters – characters in different organisms that are similar because they were inherited from a common ancestor that also had the character 8
  • 9.
    Homology infers sharedancestry • Analogous characters are the opposite of homologous characters 9
  • 10.
    10 Homologous genes have ashared ancestry. • Orthologs arise from a speciation event – multiple organisms, one gene. • Paralogs arise from a duplication event – the same organism, two different homologous genes.
  • 11.
    The Tree ofLife • Organisms with more homolologous molecular traits also have more shared physiological traits 11 Woesian ToL: Pace NR, Science 1997
  • 12.
    Sequence homology isused to infer shared ancestry • Conservative mutation encodes an amino acid replacement with similar biochemistry • Semi-conservative mutation encodes an amino acid replacement with somewhat different biochemistry, e.g., different charge • Non-conservative mutations encode amino acid replacements with different biochemistry 12
  • 13.
    Sequence alignment • Thetask of finding corresponding parts in two related sequences • Biological Significance • Prediction of gene/protein function • Gene Finding • Species identification (database searching) • Evolutionary relationships (sequence divergence) • Sequence Assembly 13
  • 14.
    Global sequence alignment •A representation of the correspondence between all the respective symbols (i.e. nucleotides or amino acids) of two sequences • If two sequences (s,t) have the same ancestor, we expect them to have many symbols and strings in common • For most symbols, we should be able to identify the corresponding homologous position in the other sequence • Represented as an “alignment” • Mutations appear as mismatches • Insertions/Deletions appear as gaps 14
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    A substitution matrixcan represent scoring 25
  • 26.
    Nucleotide substitution models •Jukes Cantor model is a one-parameter model • Two-parameter models only care about whether a substitution is a transition or transversion • Six-parameter models weighting each change differently 26
  • 27.
    Nucleotide substitution matrixes 27 •Transitions are much more common than transversions, so these are weighted differently in deciding what distance to assign to a mismatch • Six-parameter models consider different types of transitions and transversions, weighting each change differently • Gaps are also tricky… for example, adjacent gaps are not unrelated
  • 28.
    Amino Acid SubstitutionModels are more Complex 28 • Protein substitution matrices are significantly more complex than DNA scoring matrices • Proteins are composed of 20 amino acids with varying physico- chemical properties • Protein substitution matrix can be based on any property of amino acids: size, polarity, charge, hydrophobicity, etc
  • 29.
    Amino Acid SubstitutionModels are more Complex 29
  • 30.
    GC content canintroduce bias, but is NOT an indication of poor quality sequence. • Bias sequence composition of reads (ex. GC content, 5’- transcriptome) • Causes • Codon or noncoding bias • Random hexamer priming (transcriptome studies) • Library preparation (PCR bias) • Problems • Variant calling (SNPs) • Copy Number Variation (CNV) • Solution • Weighing sequence reads based on first 7 nucleotides • GC content correction • RNA-SEQC, CQN, EDA-SEQ 30
  • 31.
    How do wefind the best alignment? 31
  • 32.
    How do wefind the best alignment? 32
  • 33.
    How do wefind the best alignment? 33
  • 34.
    How do wefind the best alignment? 34
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
    • Like N-W,a dynamic programming algorithm • guaranteed to find the optimal local alignment with respect to the scoring system being used • Main difference with N-W is that negative scoring cells are set to zero, to highlight local alignments • Not practical for big alignment problems… 68
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
    Sequence Databases areEnormous! 79 ~10 Trillion Nucleotides Total! Sayers et al., Nucleic Acids Research. 2020
  • 80.
    Sequence Databases areEnormous! NCBI SRA • Raw sequence data from high throughput sequencing 80 Terabase = 1,000 Gb 1 Gb = 1 million bases 9 890 500 490 859 nt GenBank 54 966 126 472 268 601 nt SRA Total Billion Trillion Quadrillion Million
  • 81.
    Basic Local AlignmentSearch Tool (BLAST) 81
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
    Lecture Learning Goals •Explain homology and how sequence homology is related to function or physiology across groups of organisms. • Describe global sequence alignment, and contrast global versus local sequence alignment. • Describe how to find the best alignment, including the measures that allow one to quantify alignment scores. • Needleman-Wunsch • Smith-Waterman • Basic Local Alignment via BLAST • Perform a statistical analysis of alignments. 95