BLAST
Upcoming SlideShare
Loading in...5
×
 

BLAST

on

  • 2,830 views

Slides for the bioinf4biologists course

Slides for the bioinf4biologists course

Statistics

Views

Total Views
2,830
Views on SlideShare
2,587
Embed Views
243

Actions

Likes
1
Downloads
82
Comments
0

3 Embeds 243

http://bioinf.nuim.ie 135
http://bioinf.may.ie 83
http://bioinf4biologists.wordpress.com 25

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    BLAST BLAST Presentation Transcript

    • Alignment-based methods
      • Needed if we have an unknown DNA or protein sequence.
      • Purpose:
          • To find sequences/regions of significant similarity in a sequence repository or database.
          • To identify all of the homologous sequences in a database or repository.
          • To identify motifs or domains with a sequence similarity that is significantly better than chance expectation
    •  
    • Local alignment Finds domains and short regions of similarity between a pair of sequences. The two sequences under comparison do not necessarily need to have high levels of similarity over their entire length in order to receive locally high similarity scores. This feature of local similarity searches give them the advantage of being useful when looking for domains within proteins or looking for regions of genomic DNA that contain introns. Local similarity searches do not have the constraint that similarity between two sequences needs to be observed over the entire length of each gene.
    • Global alignment Finds the optimal alignment over the entire length of the two sequences under comparison. Algorithms of this nature are not particularly suited to the identification of genes that have evolved by recombination or insertion of unrelated regions of DNA. In instances such as this, a global similarity score will be greatly reduced. In cases where genes are being aligned whose sequences are of comparable length and also whose entire gene is homologous (descendant from a common ancestor), global alignment might be considered appropriate.
    • Terminology
      • Exact (Exhaustive):
          • This is a method of looking at all possibilities for a particular problem and then choosing the best one. It is the most rigorous method.
      • Heuristic:
          • This class of methods takes short-cuts and attempts to arrive at an optimal solution by making educated guesses.
    • Needleman-Wunsch Exact global alignment method. Not particularly good in many cases (database searches, looking for small regions of similarity, alignment of sequences with vastly differing lengths), but the most rigorous and thorough method if the task is to align sequences that have not evolved by exon shuffling, domain insertion/deletion etc. In other words, it is the best method if you have sequences that are of ‘similar’ length and have evolved from a common ancestor by point processes (point mutation, small indels).
    • Smith-Waterman Exact local alignment There is no requirement for the alignment to extend along the entirety of the sequences. This is a very good algorithm for database searching, multiple alignment and pairwise alignment. It is exhaustive and can be very slow (compared to the heuristics described later). The difference between this and the N-W algorithm is that alignments starting at all possible positions must be considered, not just the ones that start at the beginning and end at the end.
    •  
    • FastA algorithm
      • Firstly regions of identity are identified between the query and database sequences. ( KTUP )
      • Then the genes with the highest density of matching ‘hits’ are re-examined
      • The alignments are extended at either end of the matching regions and mis-matches and indels are incorporated according to a scoring matrix .
      • The sequence alignment then gets a score (sometimes a match is 1, a mismatch is 0 and a gap is -1)
    •  
    • FastA algorithm
      • Is the alignment significant ?
      • Could we see an alignment like this purely by chance?
      • What are the statistics involved?
    • z-opt E() < 20 0 0 : * 22 0 0 : * 24 0 0 : * 26 0 0 : * 28 0 3 : * 30 0 18 : * 32 11 70 := * 34 73 190 :==== * 36 430 389 :================ * == 38 969 644 :=========================== * =============== 40 1086 898 :======================================= * ======== 42 1332 1097 :=============================================== * ========== 44 1252 1211 :==================================================== * == 46 1022 1233 :============================================= * 48 1041 1181 :============================================== * 50 982 1077 :=========================================== * 52 846 947 :===================================== * 54 716 809 :================================ * 56 650 676 :============================= * 58 547 555 :======================== * 60 409 449 :================== * 62 369 360 :=============== * = 64 289 287 :============ * 66 232 226 :========= * = 68 176 178 :======= * 70 163 140 :====== * = 72 124 109 :==== * = 74 88 85 :=== * 76 73 66 :== * = 78 73 51 :== * = 80 44 40 := * 82 32 31 := * 84 23 24 := * 86 19 19 : * 88 15 14 : * 90 8 11 : * 92 11 9 : * :======== * == 94 3 7 : * :=== * 96 2 5 : * :== * 98 6 4 : * :=== * == 100 2 3 : * :== * 102 4 2 : * := * == 104 3 2 : * := * = 106 0 1 : * : * 108 0 1 : * : * 110 1 1 : * : * 112 0 1 : * : * 114 1 1 : * : * 116 0 0 : * * 118 0 0 : * * >120 1 0 : * * = Results of a FastA search
    • The best scores are: initn init1 opt z-sc E(13127) HP0793 polypeptide deformylase (def) {Escherichia 66 66 100 126.9 0.71 AF2215 methylmalonyl-CoA mutase, subunit alpha, N 45 45 94 113.9 1.2 AF1231 hypothetical protein 50 50 86 104.9 4.4 MJ1169 tungsten formylmethanofuran dehydrogenase, 45 45 85 102.7 4.8 AF0267 hypothetical protein 71 71 84 101.2 5.5 AF1486 hypothetical protein 83 83 84 102.4 6.1 AF0262 medium-chain acyl-CoA ligase (alkK-2) {Pse 50 50 82 99.2 7.8 AF0229 conserved hypothetical protein {Methanococ 58 58 83 103.0 8.2 D09_orf125.gseg, 378 bases, 5AC53121 checksum. 50 50 85 110.0 8.5 SL251_1.UVRC 1797 residues 40 40 81 97.5 8.9 slr2049 hypothetical protein 83 83 83 105.5 9.9 AF0868 alkyldihydroxyacetonephosphate synthase {C 45 45 80 97.7 12 AF1320 GMP synthase (guaA-2) {Methanococcus janna 35 35 82 104.5 12 SL159_1.PKSK 13344 residues 99 74 74 79.2 12 slr1771 40 40 79 95.6 13 sll1018 dihydroorotase (pyrC) 60 60 79 96.6 14 slr2102 cell division protein FtsY (ftsY) 77 77 78 94.7 15 AF0946 hypothetical protein 67 67 76 88.8 16 AF1325 multidrug resistance protein {Methanococcu 55 55 77 95.0 20 SL194_2.BFMBB 1272 residues 75 75 76 93.1 22
    • Original BLAST
      • Segment pair - This is a pair of subsequences of the same length that form an ungapped alignment.
      • BLAST searches for all segment pairs between the query sequence and all of the sequences in the database (above a certain threshold).
      • HSP - High-Scoring pair.
    • Original BLAST
      • HSPs are derived by first finding the pairs that satisfy the threshold (T) conditions. Then the alignment is extended in both directions until the quality of the alignment drops off dramatically or falls to zero.
      • The HSPs are then sorted according to their score.
    • Gapped BLAST
      • The original BLAST suffered from the limitation of not being able to introduce gaps into the alignment.
      • Gapped BLAST is an effort to circumvent this shortcoming.
      • Experience shows that often several ungapped non-overlapping alignments result from a match to a single database entry.
    • Gapped BLAST
      • Intuitively, we know that it probably makes sense to generate a single alignment of the query and database sequences.
      • Gapped BLAST seeks only 1 (instead of all) of the significant ungapped alignments between query and database sequence.
      • This speeds up the process
    • Two-Hit Method
      • Find 2 HSPs within a distance m of each other on the same diagonal.
      • Do not attempt any HSP extension unless you find two regions that meet this criterion.
      • Attempt to generate a single gapped alignment in this region
    • How does this affect the process of searching a database?
      • The treshold for identifying HSPs can be lowered (finding more HSPs and therefore slowing the process).
      • Fewer extensions are triggered (speeds up the process).
    •  
    •  
    • PSI(  -BLAST
      • Position-Specific Iterative BLAST.
      • One of a family of 'profile' searches.
      • Reweights amino acids in the alignment.
      • Performs an initial BLAST search.
      • Select those ‘hits’ that appear to be significant (above a certain threshold).
      • Use the alignment of these sequences to identify possible 'important' residues.
    • PSI-BLAST
      • Similar sequences contain almost identical information.
      • Distant relatives contain more information (if an amino acid residue is conserved in a distant relative, then it must be important!?).
      • PSI-BLAST takes into account the similarity of the 'hits' when identifying important residues.
    • PSI(  -BLAST
      • Reweigh those ‘important’ residues.
      • Repeat the BLAST search, but this time giving an increased weight to the important residues.
      • This process can be repeated ad infinitum , although usually 2 or 3 iterations will suffice.
    • PSI(  -BLAST
      • Advantages:
          • Identify more distant relatives.
          • Faster than more exact methods.
          • Does not require a priori knowledge of the important residues.
      • Disadvantages:
          • Can be misleading if an unrelated sequence is involved in reweighing the residues.
          • Not very reliable unless the initial BLAST search is capable of identifying homologues.
    •  
    •  
    •  
    • Significance of the similarity of two sequences
      • How can you know if two sequences show a higher degree of similarity than could be expected by chance?
          • Similarity could be due to similar base/AA biases.
          • Similarity could be due to sequence simplicity.
    • Randomisation test
      • Align the two sequences, record their score.
      • Hold one sequence in its original form and randomise the order of the residues in the other sequence, record the score.
      • Repeat many (1,000) times.
      • The original score should be a better score than any score from the randomised data.