Sequence Alignment,Blast, Fasta, MSA

Sucheta Tripathy, 16th November 2012

 A protein sequence from species A
◦ What is the nearest species this protein is similar
to?
◦ Where is it originated from?
◦ Putative function.
◦ If it has a conserved motif etc.

 Blast (Basic Local Alignment Search Tool)
◦ NCBI Blast
◦ Wu-Blast
◦ PSI-Blast
 Fasta
 SSearch

 Heuristic (Educated guess)
 Does not compare sequence to its entirety.
 Quickly locates short matches(seeds)
 Word size
 Seeds are extended in both directions
 Threshold is defined
◦ > Threshold -> keep the alignment
◦ < Threshold -> discard the alignment

 A Query sequence:
◦ Nucleotide
◦ Protein
 A Target Database
◦ Nucleotide
◦ Protein
 Blast Program
◦ Blastn
◦ Blastp
◦ tBlastx (Slowest Nt query translated against Nt database
trlt.)
◦ tBlastn (Protein query translated nt. Database)
◦ Blastx (Nucleotide trnslt against Protein database)

 E Value -> Probability value at which the
sequence hits may occur by chance
 Score -> Similarity score.
◦ By chance rain probability is 0.001
◦ Passing by chance etc.
◦ Less the e –value the better is the sensitivity of the
alignment.

 Remove Low Complexity regions
 Generate all the k mers.
 List All Possible matching key words.
- Blast cares about only high scoring pairs
- Fasta stores all pairs irrespective of the
scores.
 Extend the matches into high scoring
pairs(HSPs)
 Evaluate results depending on thresholds set.
 Extend HSPs and join them together.

ATGGGGCGAGGCAGCGGCACCTTCGAGCGTCTCCTAGACAAGGCGACCAGCCAGCTCCTGTTG
GAGACAGATTGGGAGTCCATTTTGCAGATCTGCGACCTGATCCGCCAAGGGGACACACAAGCA
AAATATGCTGTGAATTCCATCAAGAAGAAAGTCAACGACAAGAACCCACACGTCGCCTTGTATG
CCCTGGAGGTCATGGAATCTGTGGTAAAGAACTGTGGCCAGACAGTTCATGATGAGGTGGCCA
ACAAGCAGACCATGGAGGAGCTGAAGGACCTGCTGAAGAGACAAGTGGAGGTAAACGTCCGTA
ACAAGATCCTGTACCTGATCCAGGCCTGGGCGCATGCCTTCCGGAACGAGCCCAAGTACAAGG
TGGTCCAGGACACCTACCAGATCATGAAGGTGGAGGGGCACGTCTTTCCAGAATTCAAAGAGA
GCGATGCCATGTTTGCTGCCGAGAGAGCCCCAGACTGGGTGGACGCTGAGGAATGCCACCGCT
GCAGGGTGCAGTTCGGGGTGATGACCCGTAAGCACCACTGCCGGGCGTGTGGGCAGATATTCT
GTGGAAAGTGTTCTTCCAAGTACTCCACCATCCCCAAGTTTGGCATCGAGAAGGAGGTGCGCGT
GTGTGAGCCCTGCTACGAGCAGCTGAACAGGAAAGCGGAGGGAAAGGCCACTTCCACCACTGA

 Dot matrix method (bioinfx.net)
 Dynamic Programming method
◦ Global(Needleman-Wunsch method)
◦ Local (Smith-Waterman method)
 Word Method or K-tuple method(Heuristic)

FTFTALILLAVAV
FTALLLAAV

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC50453/pdf/pnas01096-

 Uses Neighbor joining guide tree(NJ).
◦ N number of sequences
 ½ * N! / (N-r)! -> Number of pairs
 5 sequences (5,4,3,2,1)
 (5,4), (5,3), (5,2), (5,1); (4,3),(4,2),(4,1);(3,2),(3,1);(2,1)

PAM
BLOSSUM
GONNET
DNA Identity Matrix
DNA PUPY matrix

 Substitution Matrices
 Insertion and deletions are less likely than
a substitution
 Insertion and Deletion in DNA sequence leads to Frame
shift.

PAM Matrices(Point Accepted Mutation Matrices)
Margaret Dayhoff 1978

PAM1 -> Expected rates of substition if 1% of the
amino acids have changed
BLOSUM : Blocks Substitution Matrix (% of identity)

PAM matrices are based on a
simple evolutionary model
MATLFC MLTLCC

M(A/L)TL(F/C)C Two changes
Ancestral sequence?
• Only mutations are allowed
• Sites evolve independently
15

Guidelines for using matricies

Protein Query LengthMatrix Open Gap Extend Gap
>300 BLOSUM50 -10 -2
85-300 BLOSUM62 -7 -1
50-85 BLOSUM80 -16 -4
>300 PAM250 -10 -2
85-300 PAM120 -16 -4
35-85 MDM40 -12 -2
<=35 MDM20 -22 -4
<=10 MDM10 -23 -4

PAM100 ==> Blosum90
PAM120 ==> Blosum80
PAM160 ==> Blosum60
PAM200 ==> Blosum52
PAM250 ==> Blosum45

Scoring Matrices
S = [sij] gives score of aligning character i
with character j for every pair i, j.

STPP
CTCA

0 + 3 + (-3) + 1

=1
17

Sequence Alignment,Blast, Fasta, MSA

More Related Content

What's hot

Viewers also liked

Similar to Sequence Alignment,Blast, Fasta, MSA

More from Sucheta Tripathy

Sequence Alignment,Blast, Fasta, MSA

Editor's Notes