Sucheta Tripathy, 16th November 2012
   A protein sequence from species A
    ◦ What is the nearest species this protein is similar
      to?
    ◦ Where is it originated from?
    ◦ Putative function.
    ◦ If it has a conserved motif etc.
   Blast (Basic Local Alignment Search Tool)
    ◦ NCBI Blast
    ◦ Wu-Blast
    ◦ PSI-Blast
   Fasta
   SSearch
   Heuristic (Educated guess)
   Does not compare sequence to its entirety.
   Quickly locates short matches(seeds)
   Word size
   Seeds are extended in both directions
   Threshold is defined
    ◦ > Threshold -> keep the alignment
    ◦ < Threshold -> discard the alignment
GLKFA -> 3
   GLK, LKF, FKA
   A Query sequence:
    ◦ Nucleotide
    ◦ Protein
   A Target Database
    ◦ Nucleotide
    ◦ Protein
   Blast Program
    ◦ Blastn
    ◦ Blastp
    ◦ tBlastx (Slowest Nt query translated against Nt database
      trlt.)
    ◦ tBlastn (Protein query translated nt. Database)
    ◦ Blastx (Nucleotide trnslt against Protein database)
   E Value -> Probability value at which the
    sequence hits may occur by chance
   Score -> Similarity score.
    ◦ By chance rain probability is 0.001
    ◦ Passing by chance etc.
    ◦ Less the e –value the better is the sensitivity of the
      alignment.
   Remove Low Complexity regions
   Generate all the k mers.
   List All Possible matching key words.
    - Blast cares about only high scoring pairs
    - Fasta stores all pairs irrespective of the
    scores.
   Extend the matches into high scoring
    pairs(HSPs)
   Evaluate results depending on thresholds set.
   Extend HSPs and join them together.
ATGGGGCGAGGCAGCGGCACCTTCGAGCGTCTCCTAGACAAGGCGACCAGCCAGCTCCTGTTG
GAGACAGATTGGGAGTCCATTTTGCAGATCTGCGACCTGATCCGCCAAGGGGACACACAAGCA
AAATATGCTGTGAATTCCATCAAGAAGAAAGTCAACGACAAGAACCCACACGTCGCCTTGTATG
CCCTGGAGGTCATGGAATCTGTGGTAAAGAACTGTGGCCAGACAGTTCATGATGAGGTGGCCA
ACAAGCAGACCATGGAGGAGCTGAAGGACCTGCTGAAGAGACAAGTGGAGGTAAACGTCCGTA
ACAAGATCCTGTACCTGATCCAGGCCTGGGCGCATGCCTTCCGGAACGAGCCCAAGTACAAGG
TGGTCCAGGACACCTACCAGATCATGAAGGTGGAGGGGCACGTCTTTCCAGAATTCAAAGAGA
GCGATGCCATGTTTGCTGCCGAGAGAGCCCCAGACTGGGTGGACGCTGAGGAATGCCACCGCT
GCAGGGTGCAGTTCGGGGTGATGACCCGTAAGCACCACTGCCGGGCGTGTGGGCAGATATTCT
GTGGAAAGTGTTCTTCCAAGTACTCCACCATCCCCAAGTTTGGCATCGAGAAGGAGGTGCGCGT
GTGTGAGCCCTGCTACGAGCAGCTGAACAGGAAAGCGGAGGGAAAGGCCACTTCCACCACTGA
   Dot matrix method (bioinfx.net)
   Dynamic Programming method
    ◦ Global(Needleman-Wunsch method)
    ◦ Local (Smith-Waterman method)
   Word Method or K-tuple method(Heuristic)




    FTFTALILLAVAV
    FTALLLAAV



http://www.ncbi.nlm.nih.gov/pmc/articles/PMC50453/pdf/pnas01096-
   Uses Neighbor joining guide tree(NJ).
    ◦ N number of sequences
      ½ * N! / (N-r)! -> Number of pairs
      5 sequences (5,4,3,2,1)
        (5,4), (5,3), (5,2), (5,1); (4,3),(4,2),(4,1);(3,2),(3,1);(2,1)
PAM
BLOSSUM
GONNET
DNA Identity Matrix
DNA PUPY matrix
   Substitution Matrices
      Insertion and deletions are less likely than
    a substitution
      Insertion and Deletion in DNA sequence leads to Frame
       shift.



PAM Matrices(Point Accepted Mutation Matrices)
Margaret Dayhoff 1978

PAM1 -> Expected rates of substition if 1% of the
amino acids have changed
 BLOSUM : Blocks Substitution Matrix (% of identity)
PAM matrices are based on a
   simple evolutionary model
    MATLFC          MLTLCC




          M(A/L)TL(F/C)C     Two changes
       Ancestral sequence?
• Only mutations are allowed
• Sites evolve independently
                                           15
Guidelines for using matricies


Protein Query      LengthMatrix   Open Gap   Extend Gap
>300                  BLOSUM50          -10      -2
85-300                BLOSUM62          -7       -1
50-85                 BLOSUM80          -16      -4
>300                  PAM250             -10      -2
85-300                 PAM120            -16      -4
35-85                  MDM40            -12       -2
<=35                   MDM20             -22      -4
<=10                    MDM10            -23      -4

PAM100   ==>    Blosum90
PAM120   ==>    Blosum80
PAM160   ==>    Blosum60
PAM200   ==>    Blosum52
PAM250   ==>    Blosum45
Scoring Matrices
S = [sij] gives score of aligning character i
  with character j for every pair i, j.


                              STPP
                              CTCA

                               0 + 3 + (-3) + 1

                                  =1
                                                17

Sequence Alignment,Blast, Fasta, MSA

Editor's Notes

  • #13 Series of methods that relies on pairwise alignments