Successfully reported this slideshow.

Blast bioinformatics


Published on

Published in: Technology
  • Be the first to comment

Blast bioinformatics

  1. 1. BLAST 06/01/2012
  2. 2. Introduction:     Acronym for Basic Local Alignment Search Tool The BLAST program was developed by Stephen Altschul et al of NCBI in 1990 Also a heuristic method like FASTA It is one of the most popular programs for sequence analysis
  3. 3. enables a researcher to compare a query sequence with a library or database of sequences and  identify library sequences that resemble the query sequence above a certain threshold  The objective is to find high-scoring ungapped segments among related sequences 
  4. 4. Using BLAST  1. Select BLAST program to use (blastn, blastp, 2. 3. 4. 5. blastx, tblastn, tblastx) Select database to search different BLAST programs have different databases Enter Query Sequence Submit Search
  5. 5. Steps in BLAST      The seq is optionally filtered to remove lowcomplexity regions (AGAGAG…) The next step is to create a list of words from the query sequence. Each word is typically 3 residues for protein sequences and 11 residues for DNA sequences. The list includes every possible word extracted from the query sequence. This step is also called seeding.
  6. 6. PROTEIN WORDS Query: GTQITVEDLFYNIATRRKALKN Word Size = 3 Word Size can be 2 or 3 (default = 3) GTQ TQI Make a lookup Neighborhood Words table of words QIT LTV, MTV, ISV, LSV, etc. ITV TVE VED EDL DLF ...
  7. 7. NUCLEOTIDE WORDS Query: GTACTGGACATGGACCCTACAGGAA Word Size = 11 minimum word size = 7 blastn default = 11 megablast default = 28 GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA Make a TGGACATGGAC lookup GGACATGGACC table of words GACATGGACCC ACATGGACCCT ...........
  8. 8.  The third step is to search a sequence database for the occurrence of these words.  This step is to identify database sequences containing the matching words
  9. 9. Using substitution scores matrixes the query seq. words are evaluated for matches with any DB seq. and these scores (log) are added  A cut-off score (T) is selected to reduce number of matches to the most significant ones  The above procedure is repeated for each word in the query seq.  The remaining high-scoring words are organised into efficient search tree and rapidly compared to the DB seq. 
  10. 10.  If a good match is found then an alignment is extended from the match area in both directions as far as the score continue to grow.  The extension continues until the score of the alignment drops below a threshold due to mismatches  (the drop threshold is twenty-two for proteins and twenty for DNA).
  11. 11. The resulting contiguous aligned segment pair without gaps is called high-scoring segment pair (HSP )  In the original version of BLAST, the highest scored HSPs are presented as the final report 
  12. 12. A recent improvement in the implementation of BLAST is the ability to provide gapped alignment.  In gapped BLAST, the highest scored segment is chosen to be extended in both directions using dynamic programming where gaps may be introduced.  The extension continues if the alignment score is above a certain threshold otherwise it is terminated 
  13. 13. BLAST Output 1. 2. 3. 4. an introduction that tells where the search occurred and what database and query were compared a list of the sequences in the database containing segment pairs whose scores were least likely to occur by chance alignments of the high-scoring segment pairs showing identical and similar residues a complete list of the parameter settings used for the search.
  14. 14. BLAST Variants  Program  BLASTP Query sequence Database sequence protein  BLASTN nucleic acid  BLASTX translated nucleic acid  TBLASTN protein  TBLASTX translated nucleic acid protein nucleic acid protein translated nucleic acid translated nucleic acid
  15. 15. Databases available on BLAST Web server Database Description A. Peptide sequence databases 1. nr-translations of GenBank DNA sequences with redundancies removed, PDB, SwissProt, PIR, and PRF 2. month -new or revised entries or updates to nr in the previous 30 days 3. Swissprot- latest release of the SwissProt protein sequence databasea 4. Drosophila genome -provided by Celera and Berkeley Drosophila genome project 5. yeast -yeast (Saccharomyces cerevisiae) genomic sequences 6. E. Coli- E. coli genomic sequences 7. pdb -sequences of proteins of known three-dimensional structure from the Brookhaven Protein Data Bank 8. yeast -yeast (S. cerevisiae) protein sequences 9. E. coli- E. coli genomic coding sequence translations 10. kabat [kabatpro] -Kabat’s database of sequences of immunological interest 11. Alu- translations of select Alu repeats from REPBASE, a database of sequence repeats
  16. 16.  B. Nucleotide sequence databases 1. nr- GenBank, EMBL, DDBJ, and PDB sequences with redundancies removed (EST, STS, GSS, and HTGS sequences excluded) 2. month -new or revised entries or updates to nr in the previous 30 days 3. dbestb- EST sequences from GenBank, EMBL, and DDBJ with redundancies removed 4. dbstsb- STS sequences from GenBank, EMBL, and DDBJ with redundancies removed 5. htgsb- high-throughput genomic sequences 6. kabat [kabatnuc] -Kabat’s database of sequences of immunological interest 7. vector- vector subset of GenBank 8. mito -database of mitochondrial sequences 9. alu -select Alu repeats from REPBASE, a database of sequence repeats; suitable for masking Alu repeats from query sequences 10. epd- eukaryotic promoter database 11. gssb -genome survey sequences, includes single-pass genomic data,exon-trapped sequences, and Alu PCR sequences
  17. 17. Difference between BLAST and FASTA BLAST FASTA uses a substitution matrix to find matching words Uses the hashing procedure Word size: Protein=3 ;DNA=11 K-tuple: Protein=2;DNA=4-6 Faster than FASTA Slower than BLAST have higher specificity than FASTA due to Low complexity masking Lower specificity
  18. 18. E-value (expectation value)     Important statistical indicator in Sequence alignment it indicates the probability that the resulting alignments from a database search are caused by random chance The E-value provides information about the likelihood that a given sequence match is purely by chance. The lower the E-value, the less likely the database match is a result of random chance and therefore the more significant the match is
  19. 19. Formula E-value is determined by the equation  E = m × n × P Where  m is the total number of residues in a database  n is the number of residues in the query sequence and  P is the probability that an HSP alignment is a result of random chance. 
  20. 20. Bit Score   A bit score is another prominent statistical indicator used in addition to the E value in a BLAST output. The bit score measures sequence similarity independent of query sequence length and database size and is normalized based on the raw pairwise alignment score.
  21. 21. Formula The bit score (S) is determined by the following formula: S = (λ × s − lnK)/ ln2 Where  λ is the Gumble distribution constant,  s is the raw alignment score, and  K is a constant associated with the scoring matrix used.  Thus, the bit score (S) is linearly related to the raw alignment score (s).  Hence, the higher the bit score, the more highly significant the match is. 