• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Blast 2013 1
 

Blast 2013 1

on

  • 629 views

 

Statistics

Views

Total Views
629
Views on SlideShare
629
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Blast 2013 1 Blast 2013 1 Presentation Transcript

    • Nuttachat Wisittipanit, Phd. School of Science Mae Fah Luang University
    • BLAST
    • Suppose you have acquired a DNA/Protein sequence derived from a sample of some environments such as lake, pond or plant. Introduction KLMNTRARLIVHISG LTRK………………………… …………………… Img Src: http://www.austincc.edu Sequencing process Cell Samples Your sequence
    • Introduction • Or you might get a DNA/Protein sequence from a database such as NCBI/EMBL/Swiss-Prot. You might also find an interesting gene/sequence from a journal. KLMNTRARLIVHISG LTRK………………………… …………………… Your sequence
    • • In that case, you might want to know if the sequence that you have, already exists or is similar to some sequences in a database, may be down to a particular organism database. • Why do you want to know that? • Because you can infer structural, functional and evolutionary relationship to your query sequence. Introduction Already in here? Similar? Your sequence
    • ???????????????????????????? Your Sequence Unknown Sequence What is this Sequence? Where does it come from? KLMNTRARLIVHISGLTRK
    • Introducing BLAST (Basic Local Alignment Search Tool)  BLAST tool is used to compare a query sequence with a library or database of sequences.  It uses a heuristic search algorithm based on statistical methods. The algorithm was invented by Stephen Altschul and his co-workers in 1990.  BLAST programs were designed for fast database searching.
    • BLAST Algorithm
    • BLAST Algorithm
    • BLAST Algorithm
    • BLAST Algorithm (Protein) L E H K M G S Query Sequence Length 11 L E H E H K H K M This generates 11 – 3 + 1 = 9 words H K M H K M Y A N C Y A N W = 3
    • BLAST Algorithm Example L E H For each word from a window = 3, generate neighborhood words using BLOSUM62 matrix with score threshold = 11 L M H D E H L E H C E H L K H Q E H L F H L E R . . . All aligned with LEH using BLOSUM62 (then sorted by scores) 17 13 12 10 9 11 9 9 Score threshold (cut off here) 20320 x 20 x 20 alignments Sorted by scores 3 Amino Acids
    • BLAST Algorithm Example L E H C E H L K H Q E H Word List DAPCQEHKRGWPNDC L E H Database sequences L E H L E H L E H L E H L K H L K H C E H C E H QEH Exact matches of words from the word list to the database sequences
    • Q E H D A P C Q E H K R G W P N D C For each exact word match, alignment is extended in both directions to find high score segments. Extended in the right direction Max drop off score X= 2 0 5 10 15 20 25 30 Q-Q E-E H-H K-K M-R G-G S-W AccumulatedScore 5 5 8 Score drop = 3 > X Score drop = 1 <= X Trim to max Query = Y A N C L E H K M G S K 5 235 10 18 M -1 22 G 6 28 S -3 25
    • Q E H D A P C Q E H K R G W P N D C For each exact word match, alignment is extended in both directions to find high score segments. Extended in the left direction K M G Max drop off score X= 2 0 5 10 15 20 25 30 35 H-H E-E Q-Q C-C N-P A-A Y-D AccumulatedScore 5 5 8 Score drop = 3 > X Score drop = 2 <= X Query = Y A N C L E H K M G S 18 13 8 C 9 27 N -2 25 A 4 29 Y -3 26
    • BLAST Algorithm Example A P C Q E H K R G 5 -1 65 5 894 -2 Maximal Segment Pair (MSP) Pair Score = 4-2+9+5+5+8+5-1+6 = 39 A N C Q E H K M G BLOSUM62 Scoring Matrix
    • A P C Q E H K R G A N C Q E H K M G 39 Maximal Segment Pairs (MSPs) from other seeds Sorted by alignment scores 42 45 35 37 51 55 33 BLAST Algorithm Example Each match has its own E-Value
    •  E-Value: The number of MSPs with similar score or higher that one can EXPECT to see by chance alone when searching a database of a particular size. BLAST Algorithm Expect Value (E-Value)
    •  For example: if the E-Value is equal to 10 for a particular MSP with score S, one can say that actually…about 10 MSPs with score >= S can just happen by chance alone (for any query sequence).  So most likely that our MSP is not a significant match at all. BLAST Algorithm Expect Value (E-Value)
    •  If E-Value if very small e.x. 10-4 (very high score S), one can say that it is almost impossible that there would be any MSP with score >= S.  Thus, our MSP is a pretty significant match (homologous). BLAST Algorithm Expect Value (E-Value)
    •  First: Calculate bit score  S = Score of the alignment (Raw Score)  , values depend on the scoring scheme and sequence composition of a database. [log value is natural logarithm (log base e)] BLAST Algorithm E-Value Calculation
    •  The lower the E-Value, the better.  E-Value can be used to limit the number of hits in the result page. BLAST Algorithm Expect Value (E-Value)
    •  Second: Calculate E-Value  = Bit Score  m = query length  n = length of database BLAST Algorithm E-Value Calculation
    • • E-values of 10-4 and lower indicate a significant homology. • E-values between 10-4 and 10-2 should be checked (similar domains, maybe non-homologous). • E-values between 10-2 and 1 do not indicate a good homology BLAST Algorithm E-Value Interpretation
    • Gapped BLAST  The Gapped BLAST algorithm allows gaps to be introduced into the alignments. That means similar regions are not broken into several segments.  This method reflects biological relationships much better.  This results in different parameter values when calculating E-Value ( , ).
    • BLAST programs Name Description Blastp Amino acid query sequence against a protein database Blastn Nucleotide query sequence against a nucleotide sequence database Blastx Nucleotide query sequence translated in all reading frames against a protein database Tblastn Protein query sequence against a nucleotide sequence database dynamically translated in all reading frames Tblastx Six frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
    • BLAST programs Name Common Word Size Blastp 3 Blastn 11 Blastx 3 Tblastn 3 Tblastx 3
    • BLAST Suggestion  Where possible use translated sequence (Protein).  Split large query sequence (if > 1000 for DNA, >200 for protein) into small ones.  If the query has low complexity regions or repeated segments, remove them and repeat the search. IVLKVALRPVLRPVLRPVWQARNGS Repeated segments might confuse the program to find the ‘real’ significant matches in a database.
    • Running BLAST  Find appropriate BLAST program  Enter query sequence  Select database to search  Run BLAST search  Analyze output  Interpret E-values
    • Documenting BLAST  Program (Blastp, Blastn,..)  Name of database  Word size  E-Value threshold  Substitution matrix  Gap penalty  BLAST results: Sequence Name, Bit Score, Raw Score, E-Value, Identities, Positives, Gaps
    • BLAST  http://blast.ncbi.nlm.nih.gov/
    • Homework 4A  Determine the common proteins in Domestic Cat [Felis catus], Tiger [Panthera tigris] and Snow Leopard [Uncia uncia] using this initiating sequence >gi|145558804 MSMVYINMFLAFIMSLMGLLMYRSHLMSSLLCLEGMMLSLFIMMTVAILNNHFTLASMTPII LLVFAACEAALGLSLLVMVSNTYGTDYVQNLNLLQC  Report for each protein match: Protein name, accession number, bit score, raw score, E- Value, Identities, Positives and Gaps.
    • Homework 4B  H5N1 is the subtype of the Influenza A Virus which is a bird-adapted strain. This subtype can cause “avian influenza” or “bird flu” which is fatal to human.  Use DNA sequence with GenBank Accession number JX120150.1 as a seed sequence to search for other TWO matching sequences, each belonging to a different Influenza A virus subtypes (HXNX). [Use Blastn]  Report for each subtype match: Subtype name, Organism origin, Sequence name, accession number, bit score, raw score, E- Value, Identities, Positives and Gaps
    • Homework 4C  Suppose you have acquired an unknown protein sequence FLWLWPYLSYIEAVPIRKVQDDTKTLIKTIVTRINDISHTQAVSSKQRVAGLDFIP GLHPVLSLSRMDQTLAIYQQILTSLHSRNVVQISNDLENLRDLLHLLASSKS  (1) Use BLAST program to find out which species this sequence most likely belongs to.  (2) Report both scientific and common name for the species.  (3) This sequence matches to a certain protein of that species, Report E-Value, protein accession number [GenBank], Protein name, Length, Full sequence and Function.
    • Homework 4D  Calculate E-Value for an MSP with  Raw Score : 83  Query Length : 103  Length of database : 48,109,873  : 0.316  : 0.135
    • Send me email with subject “HW3_BINF_lastname_id” by 28 June before 5pm. Late submission will NOT be accepted.