SlideShare a Scribd company logo
1 of 42
Introduction to sequence
Alignment
Outline
2
 Introduction-Definitions
 The need for sequence alignment
 Classification of sequence alignments
 The alignment problem-Complexity of alignment
Sequence Alignment
 Probably the most common
“experiment” done in biology today
 Formally considered an experiment
because you don’t know what you’ll get
until you perform the operation
 As an experiment, it is based on a
hypothesis; it uses a reproducible
technique and it generates results that
lead to conclusions or more
experiments
Fact:
Sequence comparisons
lie at the heart of all
bioinformatics
Sequence Alignment
Sequence alignment is the assignment of residue-
residue correspondences: It involves:
•- precise operators for alignment: matching, gaps
•- quantitative scoring system for matches and
gaps
•- systematic search among possible alignments
•- use alignment algorithms to find optimal
alignment
Algorithms
 An algorithm is a sequence of
instructions that one must perform in
order to solve a well-formulated
problem
 First you must identify exactly what the
problem is!
 A problem describes a class of
computational tasks. A problem for
instance is one particular input from
that task
Similarity versus Homology*
 Similarity refers to the
likeness or % identity
between 2 sequences
 Similarity means
sharing a statistically
significant number of
bases or amino acids
 Similarity does not
imply homology
 Homology refers to
shared ancestry
 Two sequences are
homologous if they
are derived from a
common ancestral
sequence
 Homology usually
implies similarity
Similarity versus
Homology*
 Similarity can be quantified
 It is correct to say that two
sequences are X% identical
 It is correct to say that two
sequences have a similarity score of
Z
 It is generally incorrect to say that
two sequences are X% similar
Homologues & All That*
 Homologue (or Homolog)
 Protein/gene that shares a common ancestor and which has
good sequence and/or structure similarity to another (general
term)
 Homology: genes that derive from a common ancestor-
these gene are called homologs
 Paralogue (or Paralog)
 A homologue which arose through gene duplication in the
same species/chromosome
 Paralogous genes are homologous genes in one organism
that derive from gene duplication
 Gene duplication: one gene is duplicated in multiple copies
that are therefore free to evolve and assume new functions
 Orthologue (or Ortholog)
 A homologue which arose through speciation (found in
different species)
 Orthologous genes are homologous genes in different
organisms
Mutations
 Causes for sequence (dis)similarity
 mutation: a nucleotide at a certain location is
replaced by another nucleotide (e.g.: ATA → AGA)
 insertion: at a certain location one new nucleotide is
inserted in between two existing
nucleotides
(e.g.: AA → AGA)
 deletion: at a certain location one existing
nucleotide
is deleted (e.g.: ACTG → AC-G)
 indel: an insertion or a deletion
Importance: Alignments tell us about...*
 Function or activity of a new gene/protein
 Structure or shape of a new protein
 Location or preferred location of a protein
 Stability of a gene or protein
 Origin of a gene or protein
 Origin or phylogeny of an organelle
 Origin or phylogeny of an organism
Sequence Complexity*
MCDEFGHIKLAN…. High Complexity
ACTGTCACTGAT…. Mid Complexity
NNNNTTTTTNNN…. Low Complexity
Assessing Sequence
Similarity
Rbn KETAAAKFERQHMD
Lsz KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNT
Rbn SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLA
Lsz QATNRNTDGSTDYGILQINSRWWCNDGRTP GSRN
Rbn DVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKY
Lsz LCNIPCSALLSSDITASVNC AKKIVSDGDGMNAWVAWR
Rbn PNACYKTTQANKHIIVACEGNPYVPHFDASV
Lsz NRCKGTDVQA WIRGCRL
is this alignment significant?
Is This Alignment
Significant?
Gelsolin 89 L G N E L S Q D E S G A A A I F T V Q L 108
Annexin 82 L P S A L K S A L S G H L E T V I L G L 101
154 L E K D I I S D T S G D F R K L M V A L 173
240 L E – S I K K E V K G D L E N A F L N L 258
314 L Y Y Y I Q Q D T K G D Y Q K A L L Y L 333
Consensus L x P x x x P D x S G x h x x h x V L L
Some Simple Rules**
 If two sequence are > 100 residues and
> 25% identical, they are likely related
 If two sequences are 15-25% identical
they may be related, but more tests are
needed
 If two sequences are < 15% identical they
are probably not related
 If you need more than 1 gap for every 20
residues the alignment is suspicious
Classifications of
sequence alignments
a) Global/local sequence alignment
b) Pairwise/multiple sequence
alignment
Global/local sequence alignment
1. Global alignment
- Input: treat the two sequences as potentially equivalent
- Goal: identify conserved regions and differences
- Algorithm: Needleman-Wunsch dynamic programming
- Applications:
- Comparing two genes with same function (in human vs.
mouse).
- Comparing two proteins with similar function.
Q: How similar are two sequences S1 and S2
Input: two sequences S1, S2 over the same alphabet
Output: two sequences S’1, S’2 of equal length
(S’1, S’2 are S1, S2 with possibly additional gaps)
Example:
 S1= GCGCATGGATTGAGCGA
 S2= TGCGCCATTGATGACC
 A possible alignment:
S’1= -GCGC-ATGGATTGAGCGA
S’2= TGCGCCATTGAT-GACC--
Global/local sequence alignment
2. Local alignment
- Input: The two sequences may or may not be related
- Goal: see whether a substring in one sequence aligns well with a
substring in the other
- Algorithm: Smith-Waterman dynamic programming
- Note: for local matching, overhangs at the ends are not treated as
gaps
- Applications:
- Searching for local similarities in large sequences
(e.g., newly sequenced genomes)
-Looking for conserved domains or motifs in two proteins
Q: Find the pair of substrings in two input sequences which have the
highest similarity
Input: two sequences S1, S2 over the same alphabet
Output: two sequences S’1, S’2 of equal length
(S’1, S’2 are substrings of S1, S2 with possibly additional gaps)
Example:
 S1= GCGCATGGATTGAGCGA
 S2= TGCGCCATTGATGACC
 A possible alignment:
S’1= ATTGA-G
S’2= ATTGATG
Global vs. Local Alignments
 Global alignment algorithms start at the
beginning of two sequences and add gaps
to each until the end of one is reached.
 Local alignment algorithms finds the
region (or regions) of highest similarity
between two sequences and build the
alignment outward from there.
Global/local sequence alignment
3. Semi-global alignment
- Input: two sequences, one short and one long
- Goal: is the short one a part of the long one?
- Algorithm: modification of Smith-Waterman
- Applications:
- Given a DNA fragment (with possible error), look for it in the genome
- Look for a well-known domain in a newly-sequenced protein.
4. Suffix-prefix alignment
- Input: two sequences (usually DNA)
- Goal: is the prefix of one the suffix of the other?
- Algorithm: modification of Smith-Waterman.
- Applications:
- DNA fragment assembly
5. Heuristic alignment
- Input: two sequences
- Goal: See if two sequences are "similar" or candidates for alignment
- Algorithms: BLAST, FASTA (and others)
- Applications:
- Search in large databases
Database search methods: Sequence Alignment
The most widely used local similarity algorithms are:
Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)
Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov)
Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;
http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
Which algorithm to use for database similarity
search?
BLAST > FASTA > Smith-Waterman (It is VERY SLOW and
uses a LOT OF COMPUTER POWER)
FASTA is more sensitive, misses less homologues
Smith-Waterman is even more sensitive.
BLAST calculates probabilities
FASTA more accurate for DNA-DNA search then BLAST
Pairwise/multiple sequence alignment
Multiple sequence alignment (MSA) can be seen as a generalization
of Pairwise Sequence Alignment - instead of aligning two sequences,
n sequences are aligned simultaneously, where n is > 2
Definition:
A multiple sequence alignment is an alignment of n > 2 sequences obtained
by inserting gaps (“-”) into sequences such that the resulting sequences have
all length L and can be arranged in a matrix of N rows and L columns where
each column represents a homologous position
Note: MSA applies both to nucleotide and amino acid sequences
To construct a multiple alignment, one may have to introduce gaps in sequences
at positions where there were no gaps in the corresponding pairwise alignment
 multiple alignments typically contain more gaps than any given pair of
aligned sequences
Multiple sequence alignment (MSA)
Pairwise sequence alignment
A pairwise sequence alignment is an alignment of 2 sequences
obtained by inserting gaps (“-”) such that the resulting sequences
have the same length and where each pair of residues represents a
homologous position
Keyword search vs. alignment
Keyword search
- keyword search is exact matching
- can be done quickly (straightforward scan)
- used in Entrez (for example)
Alignment
- non-exact, scored matching
- takes much more time
- used in tools like BLAST2, CLUSTALW
Why do we need (multiple) sequence alignment?
Multiple sequence alignment can help to develop a sequence “finger print” which allows the
identification of members of distantly related protein family (motifs)
Formulate & test hypotheses about protein 3-D structure
MSA can help us to reveal biological facts about proteins, e.g.:
(e.g. how protein function has changed or evolutionary pressure acting on a gene)
Crucial for genome sequencing:
-Random fragments of a large molecule are sequenced and those that overlap are
found by a multiple sequence alignment program.
- Sequence may be from one strand of DNA or the other, so complements of each
sequence must also be compared
- Sequence fragments will usually overlap, but by an unknown amount and in
some cases, one sequence may be included within another
- All of the overlapping pairs of sequence fragments must be assembled into large
composite genome sequence
To establish homology for phylogenetic analyses
Identify primers and probes to search for homologous sequences in other organisms
The alignment problem
Taxon A AGAC
Taxon B --AC
Taxon C AG--
Taxon A AGAC
Taxon C AG--
Taxon B --AC
Taxon B AC--
Taxon C AG--
Taxon A AGAC
Taxon B --AC
Taxon C --AG
Taxon A AGAC
It is not self-evident how these
sequences are to be aligned together.
Here are some possibilities:
How do we generate a multiple alignment? Given a pairwise alignment, just
add the third, then the fourth, and so on, until all have been aligned. Does it
work?
Example:
Taxon A AGAC
Taxon B --AC
Taxon A AGAC
Taxon C AG--
Taxon B AC
Taxon C AG
It depends not only on the various alignment parameters but also on the order in
which sequences are added to the multiple alignment
The alignment problem
What happens when a sequence alignment is wrong?
A B C A C B B C A
A: AGT
B: AT
C: ATC
A: AGT
B: A -T
C: ATC
A: AGT
B: AT -
C: ATC
A: AGT -
B: A -T -
C: A -TC
From pairwise to multiple alignments
In pairwise alignments, one has a two-dimensional matrix
with the sequences on each axis. The number of operations
required to locate the best “path” through the matrix is
approximately proportional to the product of the lengths of the
two sequences
A possible general method would be to extend the pairwise
alignment method into a simultaneous N-wise alignment, using
a complete dynamical-programming algorithm in N dimensions.
Algorithmically, this is not difficult to do
But what about execution time?
Algorithm Complexity ‘The big-O notation’
One of the most important properties of an algorithm is how its
execution time increases as the problem is made larger (e.g. more
sequences to align).
This is the so-called algorithmic (or computational) complexity of the
algorithm
There is a notation to describe the algorithmic complexity, called the
big-O notation.
If we have a problem size (number of input data points) n, then an
algorithm takes O(n) time if the time increases linearly with n. If the
algorithm needs time proportional to the square of n, then it is O(n2)
It is important to realize that an algorithm that is quick on small
problems may be totally useless on large problems if it has a bad O()
behavior. As a rule of thumb one can use the following
characterizations, where n is the size of the problem, and c is a
constant:
The big-O notation
•To compute a N-wise alignment, the algorithmic
complexity is something like O(c2n),
where c is a constant, and n is the number of
sequences
Example:
A pairwise alignment of two sequences [O(c2x2)], takes 1 second,
then four sequences [O(c2x4)], would take 104 seconds (2.8
hours), five sequences [O(c2x5)], 106 seconds (11.6 days), six
sequences [O(c2x6)], 108 seconds (3.2 years), seven sequences
[O(c2x7)], 1010 seconds (317 years), and so on
This is disastrous!
How to optimize alignment algorithms?
Use structural information:
- reading frame
- protein structure
-Sequence elements are not truly independent but
related by phylogeny
NK/-YLS
NK/-Y/FL/-S
NKYLSNYLS NFS NFLS
NFL/-S
N – Y L S
N K Y L S
N – F – S
N – F L S
Raw
Human N Y L S
Chimp N K Y L S
Gorilla N F S
Orangutan N F L S
Alignment
Human Chimp Gorilla Orangutan
How to optimize alignment algorithms?
Sequences often contain highly conserved regions
These regions can be used for an initial alignment
By analyzing a number of small, independent fragments,
the algorithmic complexity can be drastically reduced!
“Optimal” vs. “correct” alignment
For a given group of sequences, there is no single “correct”
alignment, only an alignment that is “optimal” according to some
set of calculations
This is partly due to:
- the complexity of the problem,
- limitations of the scoring systems used,
- our limited understanding of life and evolution
Success of the alignment will depend on the similarity of the
sequences. If sequence variation is great it will be very difficult to
find an optimal alignment
Sequence alignment and gaps
Gaps can occur:
Before the first character of a string
CTGCGGG---GGTAAT
|||| || ||
--GCGG-AGAGG-AA-
Inside a string
CTGCGGG---GGTAAT
|||| || ||
--GCGG-AGAGG-AA-
After the last character of a string
CTGCGGG---GGTAAT
|||| || ||
--GCGG-AGAGG-AA-
Note: In protein-coding nucleotide sequences most gaps have a length of 3N
Gap Penalties
In the MSA scoring scheme, a penalty is subtracted for each gap introduced
into an alignment because the gap increases uncertainty into an alignment
The gap penalty is used to help decide whether or not to accept a gap or
insertion in an alignment
Biologically, it should in general be easier for a sequence to accept a different
residue in a position, rather than having parts of the sequence chopped away
or inserted. Gaps/insertions should therefore be more rare than point
mutations (substitutions)
In general, the lower the gapping penalties, the more gaps and more
identities are detected but this should be considered in relation to biological
significance
Most MSA programs allow for an adjustment of gap penalties
Sequence alignment and gaps
END

More Related Content

What's hot

BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
nadeem akhter
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
avrilcoghlan
 

What's hot (20)

Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
FASTA
FASTAFASTA
FASTA
 
Fasta
FastaFasta
Fasta
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
Sequence Submission Tools
Sequence Submission ToolsSequence Submission Tools
Sequence Submission Tools
 
BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
Sequence Alignment In Bioinformatics
Sequence Alignment In BioinformaticsSequence Alignment In Bioinformatics
Sequence Alignment In Bioinformatics
 
PAM : Point Accepted Mutation
PAM : Point Accepted MutationPAM : Point Accepted Mutation
PAM : Point Accepted Mutation
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Prosite
PrositeProsite
Prosite
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Needleman-Wunsch Algorithm
Needleman-Wunsch AlgorithmNeedleman-Wunsch Algorithm
Needleman-Wunsch Algorithm
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Protein database
Protein databaseProtein database
Protein database
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 

Viewers also liked (10)

Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
Basics Of Molecular Docking
Basics Of Molecular DockingBasics Of Molecular Docking
Basics Of Molecular Docking
 
Molecular dynamics and Simulations
Molecular dynamics and SimulationsMolecular dynamics and Simulations
Molecular dynamics and Simulations
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
MD Simulation
MD SimulationMD Simulation
MD Simulation
 
Molecular docking
Molecular dockingMolecular docking
Molecular docking
 
Molecular docking
Molecular dockingMolecular docking
Molecular docking
 
MOLECULAR DOCKING
MOLECULAR DOCKINGMOLECULAR DOCKING
MOLECULAR DOCKING
 
Chemistry of amino acids
Chemistry of amino acidsChemistry of amino acids
Chemistry of amino acids
 
Chemistry of amino acids
Chemistry of amino acidsChemistry of amino acids
Chemistry of amino acids
 

Similar to Introduction to sequence alignment

Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
Abhishek Vatsa
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
Rai University
 

Similar to Introduction to sequence alignment (20)

Laboratory 1 sequence_alignments
Laboratory 1 sequence_alignmentsLaboratory 1 sequence_alignments
Laboratory 1 sequence_alignments
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
BLAST
BLASTBLAST
BLAST
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
1 md2016 homology
1 md2016 homology1 md2016 homology
1 md2016 homology
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02
 
Sequence alignment.pptx
Sequence alignment.pptxSequence alignment.pptx
Sequence alignment.pptx
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptx
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptx
 
Bioinformatics t4-alignments v2014
Bioinformatics t4-alignments v2014Bioinformatics t4-alignments v2014
Bioinformatics t4-alignments v2014
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignmentB.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 

Recently uploaded

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactistics
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx21st_Century_Skills_Framework_Final_Presentation_2.pptx
21st_Century_Skills_Framework_Final_Presentation_2.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 

Introduction to sequence alignment

  • 2. Outline 2  Introduction-Definitions  The need for sequence alignment  Classification of sequence alignments  The alignment problem-Complexity of alignment
  • 3. Sequence Alignment  Probably the most common “experiment” done in biology today  Formally considered an experiment because you don’t know what you’ll get until you perform the operation  As an experiment, it is based on a hypothesis; it uses a reproducible technique and it generates results that lead to conclusions or more experiments
  • 4. Fact: Sequence comparisons lie at the heart of all bioinformatics
  • 5. Sequence Alignment Sequence alignment is the assignment of residue- residue correspondences: It involves: •- precise operators for alignment: matching, gaps •- quantitative scoring system for matches and gaps •- systematic search among possible alignments •- use alignment algorithms to find optimal alignment
  • 6. Algorithms  An algorithm is a sequence of instructions that one must perform in order to solve a well-formulated problem  First you must identify exactly what the problem is!  A problem describes a class of computational tasks. A problem for instance is one particular input from that task
  • 7. Similarity versus Homology*  Similarity refers to the likeness or % identity between 2 sequences  Similarity means sharing a statistically significant number of bases or amino acids  Similarity does not imply homology  Homology refers to shared ancestry  Two sequences are homologous if they are derived from a common ancestral sequence  Homology usually implies similarity
  • 8. Similarity versus Homology*  Similarity can be quantified  It is correct to say that two sequences are X% identical  It is correct to say that two sequences have a similarity score of Z  It is generally incorrect to say that two sequences are X% similar
  • 9. Homologues & All That*  Homologue (or Homolog)  Protein/gene that shares a common ancestor and which has good sequence and/or structure similarity to another (general term)  Homology: genes that derive from a common ancestor- these gene are called homologs  Paralogue (or Paralog)  A homologue which arose through gene duplication in the same species/chromosome  Paralogous genes are homologous genes in one organism that derive from gene duplication  Gene duplication: one gene is duplicated in multiple copies that are therefore free to evolve and assume new functions  Orthologue (or Ortholog)  A homologue which arose through speciation (found in different species)  Orthologous genes are homologous genes in different organisms
  • 10. Mutations  Causes for sequence (dis)similarity  mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA)  insertion: at a certain location one new nucleotide is inserted in between two existing nucleotides (e.g.: AA → AGA)  deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G)  indel: an insertion or a deletion
  • 11. Importance: Alignments tell us about...*  Function or activity of a new gene/protein  Structure or shape of a new protein  Location or preferred location of a protein  Stability of a gene or protein  Origin of a gene or protein  Origin or phylogeny of an organelle  Origin or phylogeny of an organism
  • 12. Sequence Complexity* MCDEFGHIKLAN…. High Complexity ACTGTCACTGAT…. Mid Complexity NNNNTTTTTNNN…. Low Complexity
  • 13. Assessing Sequence Similarity Rbn KETAAAKFERQHMD Lsz KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNT Rbn SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLA Lsz QATNRNTDGSTDYGILQINSRWWCNDGRTP GSRN Rbn DVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKY Lsz LCNIPCSALLSSDITASVNC AKKIVSDGDGMNAWVAWR Rbn PNACYKTTQANKHIIVACEGNPYVPHFDASV Lsz NRCKGTDVQA WIRGCRL is this alignment significant?
  • 14. Is This Alignment Significant? Gelsolin 89 L G N E L S Q D E S G A A A I F T V Q L 108 Annexin 82 L P S A L K S A L S G H L E T V I L G L 101 154 L E K D I I S D T S G D F R K L M V A L 173 240 L E – S I K K E V K G D L E N A F L N L 258 314 L Y Y Y I Q Q D T K G D Y Q K A L L Y L 333 Consensus L x P x x x P D x S G x h x x h x V L L
  • 15. Some Simple Rules**  If two sequence are > 100 residues and > 25% identical, they are likely related  If two sequences are 15-25% identical they may be related, but more tests are needed  If two sequences are < 15% identical they are probably not related  If you need more than 1 gap for every 20 residues the alignment is suspicious
  • 16. Classifications of sequence alignments a) Global/local sequence alignment b) Pairwise/multiple sequence alignment
  • 17. Global/local sequence alignment 1. Global alignment - Input: treat the two sequences as potentially equivalent - Goal: identify conserved regions and differences - Algorithm: Needleman-Wunsch dynamic programming - Applications: - Comparing two genes with same function (in human vs. mouse). - Comparing two proteins with similar function. Q: How similar are two sequences S1 and S2 Input: two sequences S1, S2 over the same alphabet Output: two sequences S’1, S’2 of equal length (S’1, S’2 are S1, S2 with possibly additional gaps) Example:  S1= GCGCATGGATTGAGCGA  S2= TGCGCCATTGATGACC  A possible alignment: S’1= -GCGC-ATGGATTGAGCGA S’2= TGCGCCATTGAT-GACC--
  • 18. Global/local sequence alignment 2. Local alignment - Input: The two sequences may or may not be related - Goal: see whether a substring in one sequence aligns well with a substring in the other - Algorithm: Smith-Waterman dynamic programming - Note: for local matching, overhangs at the ends are not treated as gaps - Applications: - Searching for local similarities in large sequences (e.g., newly sequenced genomes) -Looking for conserved domains or motifs in two proteins Q: Find the pair of substrings in two input sequences which have the highest similarity Input: two sequences S1, S2 over the same alphabet Output: two sequences S’1, S’2 of equal length (S’1, S’2 are substrings of S1, S2 with possibly additional gaps) Example:  S1= GCGCATGGATTGAGCGA  S2= TGCGCCATTGATGACC  A possible alignment: S’1= ATTGA-G S’2= ATTGATG
  • 19. Global vs. Local Alignments  Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached.  Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.
  • 20. Global/local sequence alignment 3. Semi-global alignment - Input: two sequences, one short and one long - Goal: is the short one a part of the long one? - Algorithm: modification of Smith-Waterman - Applications: - Given a DNA fragment (with possible error), look for it in the genome - Look for a well-known domain in a newly-sequenced protein. 4. Suffix-prefix alignment - Input: two sequences (usually DNA) - Goal: is the prefix of one the suffix of the other? - Algorithm: modification of Smith-Waterman. - Applications: - DNA fragment assembly 5. Heuristic alignment - Input: two sequences - Goal: See if two sequences are "similar" or candidates for alignment - Algorithms: BLAST, FASTA (and others) - Applications: - Search in large databases
  • 21. Database search methods: Sequence Alignment The most widely used local similarity algorithms are: Smith-Waterman (http://www.ebi.ac.uk/MPsrch/) Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov) Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/; http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
  • 22.
  • 23. Which algorithm to use for database similarity search? BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a LOT OF COMPUTER POWER) FASTA is more sensitive, misses less homologues Smith-Waterman is even more sensitive. BLAST calculates probabilities FASTA more accurate for DNA-DNA search then BLAST
  • 24. Pairwise/multiple sequence alignment Multiple sequence alignment (MSA) can be seen as a generalization of Pairwise Sequence Alignment - instead of aligning two sequences, n sequences are aligned simultaneously, where n is > 2 Definition: A multiple sequence alignment is an alignment of n > 2 sequences obtained by inserting gaps (“-”) into sequences such that the resulting sequences have all length L and can be arranged in a matrix of N rows and L columns where each column represents a homologous position Note: MSA applies both to nucleotide and amino acid sequences To construct a multiple alignment, one may have to introduce gaps in sequences at positions where there were no gaps in the corresponding pairwise alignment  multiple alignments typically contain more gaps than any given pair of aligned sequences Multiple sequence alignment (MSA) Pairwise sequence alignment A pairwise sequence alignment is an alignment of 2 sequences obtained by inserting gaps (“-”) such that the resulting sequences have the same length and where each pair of residues represents a homologous position
  • 25.
  • 26.
  • 27. Keyword search vs. alignment Keyword search - keyword search is exact matching - can be done quickly (straightforward scan) - used in Entrez (for example) Alignment - non-exact, scored matching - takes much more time - used in tools like BLAST2, CLUSTALW
  • 28. Why do we need (multiple) sequence alignment? Multiple sequence alignment can help to develop a sequence “finger print” which allows the identification of members of distantly related protein family (motifs) Formulate & test hypotheses about protein 3-D structure MSA can help us to reveal biological facts about proteins, e.g.: (e.g. how protein function has changed or evolutionary pressure acting on a gene) Crucial for genome sequencing: -Random fragments of a large molecule are sequenced and those that overlap are found by a multiple sequence alignment program. - Sequence may be from one strand of DNA or the other, so complements of each sequence must also be compared - Sequence fragments will usually overlap, but by an unknown amount and in some cases, one sequence may be included within another - All of the overlapping pairs of sequence fragments must be assembled into large composite genome sequence To establish homology for phylogenetic analyses Identify primers and probes to search for homologous sequences in other organisms
  • 29. The alignment problem Taxon A AGAC Taxon B --AC Taxon C AG-- Taxon A AGAC Taxon C AG-- Taxon B --AC Taxon B AC-- Taxon C AG-- Taxon A AGAC Taxon B --AC Taxon C --AG Taxon A AGAC It is not self-evident how these sequences are to be aligned together. Here are some possibilities: How do we generate a multiple alignment? Given a pairwise alignment, just add the third, then the fourth, and so on, until all have been aligned. Does it work? Example: Taxon A AGAC Taxon B --AC Taxon A AGAC Taxon C AG-- Taxon B AC Taxon C AG It depends not only on the various alignment parameters but also on the order in which sequences are added to the multiple alignment
  • 30. The alignment problem What happens when a sequence alignment is wrong? A B C A C B B C A A: AGT B: AT C: ATC A: AGT B: A -T C: ATC A: AGT B: AT - C: ATC A: AGT - B: A -T - C: A -TC
  • 31. From pairwise to multiple alignments In pairwise alignments, one has a two-dimensional matrix with the sequences on each axis. The number of operations required to locate the best “path” through the matrix is approximately proportional to the product of the lengths of the two sequences A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a complete dynamical-programming algorithm in N dimensions. Algorithmically, this is not difficult to do But what about execution time?
  • 32. Algorithm Complexity ‘The big-O notation’ One of the most important properties of an algorithm is how its execution time increases as the problem is made larger (e.g. more sequences to align). This is the so-called algorithmic (or computational) complexity of the algorithm There is a notation to describe the algorithmic complexity, called the big-O notation. If we have a problem size (number of input data points) n, then an algorithm takes O(n) time if the time increases linearly with n. If the algorithm needs time proportional to the square of n, then it is O(n2) It is important to realize that an algorithm that is quick on small problems may be totally useless on large problems if it has a bad O() behavior. As a rule of thumb one can use the following characterizations, where n is the size of the problem, and c is a constant:
  • 33. The big-O notation •To compute a N-wise alignment, the algorithmic complexity is something like O(c2n), where c is a constant, and n is the number of sequences Example: A pairwise alignment of two sequences [O(c2x2)], takes 1 second, then four sequences [O(c2x4)], would take 104 seconds (2.8 hours), five sequences [O(c2x5)], 106 seconds (11.6 days), six sequences [O(c2x6)], 108 seconds (3.2 years), seven sequences [O(c2x7)], 1010 seconds (317 years), and so on This is disastrous!
  • 34. How to optimize alignment algorithms? Use structural information: - reading frame - protein structure -Sequence elements are not truly independent but related by phylogeny NK/-YLS NK/-Y/FL/-S NKYLSNYLS NFS NFLS NFL/-S N – Y L S N K Y L S N – F – S N – F L S Raw Human N Y L S Chimp N K Y L S Gorilla N F S Orangutan N F L S Alignment Human Chimp Gorilla Orangutan
  • 35.
  • 36. How to optimize alignment algorithms? Sequences often contain highly conserved regions These regions can be used for an initial alignment By analyzing a number of small, independent fragments, the algorithmic complexity can be drastically reduced!
  • 37.
  • 38. “Optimal” vs. “correct” alignment For a given group of sequences, there is no single “correct” alignment, only an alignment that is “optimal” according to some set of calculations This is partly due to: - the complexity of the problem, - limitations of the scoring systems used, - our limited understanding of life and evolution Success of the alignment will depend on the similarity of the sequences. If sequence variation is great it will be very difficult to find an optimal alignment
  • 39.
  • 40. Sequence alignment and gaps Gaps can occur: Before the first character of a string CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA- Inside a string CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA- After the last character of a string CTGCGGG---GGTAAT |||| || || --GCGG-AGAGG-AA- Note: In protein-coding nucleotide sequences most gaps have a length of 3N
  • 41. Gap Penalties In the MSA scoring scheme, a penalty is subtracted for each gap introduced into an alignment because the gap increases uncertainty into an alignment The gap penalty is used to help decide whether or not to accept a gap or insertion in an alignment Biologically, it should in general be easier for a sequence to accept a different residue in a position, rather than having parts of the sequence chopped away or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions) In general, the lower the gapping penalties, the more gaps and more identities are detected but this should be considered in relation to biological significance Most MSA programs allow for an adjustment of gap penalties Sequence alignment and gaps
  • 42. END