Blast 2013 1


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Blast 2013 1

  1. 1. Nuttachat Wisittipanit, Phd. School of Science Mae Fah Luang University
  2. 2. BLAST
  3. 3. Suppose you have acquired a DNA/Protein sequence derived from a sample of some environments such as lake, pond or plant. Introduction KLMNTRARLIVHISG LTRK………………………… …………………… Img Src: Sequencing process Cell Samples Your sequence
  4. 4. Introduction • Or you might get a DNA/Protein sequence from a database such as NCBI/EMBL/Swiss-Prot. You might also find an interesting gene/sequence from a journal. KLMNTRARLIVHISG LTRK………………………… …………………… Your sequence
  5. 5. • In that case, you might want to know if the sequence that you have, already exists or is similar to some sequences in a database, may be down to a particular organism database. • Why do you want to know that? • Because you can infer structural, functional and evolutionary relationship to your query sequence. Introduction Already in here? Similar? Your sequence
  6. 6. ???????????????????????????? Your Sequence Unknown Sequence What is this Sequence? Where does it come from? KLMNTRARLIVHISGLTRK
  7. 7. Introducing BLAST (Basic Local Alignment Search Tool)  BLAST tool is used to compare a query sequence with a library or database of sequences.  It uses a heuristic search algorithm based on statistical methods. The algorithm was invented by Stephen Altschul and his co-workers in 1990.  BLAST programs were designed for fast database searching.
  8. 8. BLAST Algorithm
  9. 9. BLAST Algorithm
  10. 10. BLAST Algorithm
  11. 11. BLAST Algorithm (Protein) L E H K M G S Query Sequence Length 11 L E H E H K H K M This generates 11 – 3 + 1 = 9 words H K M H K M Y A N C Y A N W = 3
  12. 12. BLAST Algorithm Example L E H For each word from a window = 3, generate neighborhood words using BLOSUM62 matrix with score threshold = 11 L M H D E H L E H C E H L K H Q E H L F H L E R . . . All aligned with LEH using BLOSUM62 (then sorted by scores) 17 13 12 10 9 11 9 9 Score threshold (cut off here) 20320 x 20 x 20 alignments Sorted by scores 3 Amino Acids
  13. 13. BLAST Algorithm Example L E H C E H L K H Q E H Word List DAPCQEHKRGWPNDC L E H Database sequences L E H L E H L E H L E H L K H L K H C E H C E H QEH Exact matches of words from the word list to the database sequences
  14. 14. Q E H D A P C Q E H K R G W P N D C For each exact word match, alignment is extended in both directions to find high score segments. Extended in the right direction Max drop off score X= 2 0 5 10 15 20 25 30 Q-Q E-E H-H K-K M-R G-G S-W AccumulatedScore 5 5 8 Score drop = 3 > X Score drop = 1 <= X Trim to max Query = Y A N C L E H K M G S K 5 235 10 18 M -1 22 G 6 28 S -3 25
  15. 15. Q E H D A P C Q E H K R G W P N D C For each exact word match, alignment is extended in both directions to find high score segments. Extended in the left direction K M G Max drop off score X= 2 0 5 10 15 20 25 30 35 H-H E-E Q-Q C-C N-P A-A Y-D AccumulatedScore 5 5 8 Score drop = 3 > X Score drop = 2 <= X Query = Y A N C L E H K M G S 18 13 8 C 9 27 N -2 25 A 4 29 Y -3 26
  16. 16. BLAST Algorithm Example A P C Q E H K R G 5 -1 65 5 894 -2 Maximal Segment Pair (MSP) Pair Score = 4-2+9+5+5+8+5-1+6 = 39 A N C Q E H K M G BLOSUM62 Scoring Matrix
  17. 17. A P C Q E H K R G A N C Q E H K M G 39 Maximal Segment Pairs (MSPs) from other seeds Sorted by alignment scores 42 45 35 37 51 55 33 BLAST Algorithm Example Each match has its own E-Value
  18. 18.  E-Value: The number of MSPs with similar score or higher that one can EXPECT to see by chance alone when searching a database of a particular size. BLAST Algorithm Expect Value (E-Value)
  19. 19.  For example: if the E-Value is equal to 10 for a particular MSP with score S, one can say that actually…about 10 MSPs with score >= S can just happen by chance alone (for any query sequence).  So most likely that our MSP is not a significant match at all. BLAST Algorithm Expect Value (E-Value)
  20. 20.  If E-Value if very small e.x. 10-4 (very high score S), one can say that it is almost impossible that there would be any MSP with score >= S.  Thus, our MSP is a pretty significant match (homologous). BLAST Algorithm Expect Value (E-Value)
  21. 21.  First: Calculate bit score  S = Score of the alignment (Raw Score)  , values depend on the scoring scheme and sequence composition of a database. [log value is natural logarithm (log base e)] BLAST Algorithm E-Value Calculation
  22. 22.  The lower the E-Value, the better.  E-Value can be used to limit the number of hits in the result page. BLAST Algorithm Expect Value (E-Value)
  23. 23.  Second: Calculate E-Value  = Bit Score  m = query length  n = length of database BLAST Algorithm E-Value Calculation
  24. 24. • E-values of 10-4 and lower indicate a significant homology. • E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous). • E-values between 10-2 and 1 do not indicate a good homology BLAST Algorithm E-Value Interpretation
  25. 25. Gapped BLAST  The Gapped BLAST algorithm allows gaps to be introduced into the alignments. That means similar regions are not broken into several segments.  This method reflects biological relationships much better.  This results in different parameter values when calculating E-Value ( , ).
  26. 26. BLAST programs Name Description Blastp Amino acid query sequence against a protein database Blastn Nucleotide query sequence against a nucleotide sequence database Blastx Nucleotide query sequence translated in all reading frames against a protein database Tblastn Protein query sequence against a nucleotide sequence database dynamically translated in all reading frames Tblastx Six frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
  27. 27. BLAST programs Name Common Word Size Blastp 3 Blastn 11 Blastx 3 Tblastn 3 Tblastx 3
  28. 28. BLAST Suggestion  Where possible use translated sequence (Protein).  Split large query sequence (if > 1000 for DNA, >200 for protein) into small ones.  If the query has low complexity regions or repeated segments, remove them and repeat the search. IVLKVALRPVLRPVLRPVWQARNGS Repeated segments might confuse the program to find the ‘real’ significant matches in a database.
  29. 29. Running BLAST  Find appropriate BLAST program  Enter query sequence  Select database to search  Run BLAST search  Analyze output  Interpret E-values
  30. 30. Documenting BLAST  Program (Blastp, Blastn,..)  Name of database  Word size  E-Value threshold  Substitution matrix  Gap penalty  BLAST results: Sequence Name, Bit Score, Raw Score, E-Value, Identities, Positives, Gaps
  31. 31. BLAST 
  32. 32. Homework 4A  Determine the common proteins in Domestic Cat [Felis catus], Tiger [Panthera tigris] and Snow Leopard [Uncia uncia] using this initiating sequence >gi|145558804 MSMVYINMFLAFIMSLMGLLMYRSHLMSSLLCLEGMMLSLFIMMTVAILNNHFTLASMTPII LLVFAACEAALGLSLLVMVSNTYGTDYVQNLNLLQC  Report for each protein match: Protein name, accession number, bit score, raw score, E- Value, Identities, Positives and Gaps.
  33. 33. Homework 4B  H5N1 is the subtype of the Influenza A Virus which is a bird-adapted strain. This subtype can cause “avian influenza” or “bird flu” which is fatal to human.  Use DNA sequence with GenBank Accession number JX120150.1 as a seed sequence to search for other TWO matching sequences, each belonging to a different Influenza A virus subtypes (HXNX). [Use Blastn]  Report for each subtype match: Subtype name, Organism origin, Sequence name, accession number, bit score, raw score, E- Value, Identities, Positives and Gaps
  34. 34. Homework 4C  Suppose you have acquired an unknown protein sequence FLWLWPYLSYIEAVPIRKVQDDTKTLIKTIVTRINDISHTQAVSSKQRVAGLDFIP GLHPVLSLSRMDQTLAIYQQILTSLHSRNVVQISNDLENLRDLLHLLASSKS  (1) Use BLAST program to find out which species this sequence most likely belongs to.  (2) Report both scientific and common name for the species.  (3) This sequence matches to a certain protein of that species, Report E-Value, protein accession number [GenBank], Protein name, Length, Full sequence and Function.
  35. 35. Homework 4D  Calculate E-Value for an MSP with  Raw Score : 83  Query Length : 103  Length of database : 48,109,873  : 0.316  : 0.135
  36. 36. Send me email with subject “HW3_BINF_lastname_id” by 28 June before 5pm. Late submission will NOT be accepted.