The figure of 28,000,000 protein sequences is from searching NCBI Protein for 1:10000000000000000000000[SLEN] on 18-Feb-2011. Got 38535878 matching protein sequences. Image credit (filing cabinet): http://etc.usf.edu/clipart/13000/13089/file_cabinet_13089_lg.gif
BLAST Dr Avril Coghlan firstname.lastname@example.orgNote: this talk contains animations which can only be seen bydownloading and using ‘View Slide show’ in Powerpoint
• Sequence alignment has many uses Sequence assembly – genome sequences are assembled by using sequence alignment methods to find overlaps between many short pieces of DNA Gene finding – alignment of whole genome sequences from two or more species can aid in discovery of previously unknown genes Sequence divergence – the amount of sequence similarity between sequences (which can be calculated from a sequence alignment) tells us how closely they are related Database searching – we use fast sequence alignment methods (eg. BLAST) to determine whether a protein/DNA sequence is similar to any known sequence Prediction of function – if we know the function of a sequence, we can predict the function of similar sequences identified by database searching (eg. for fruitfly eyeless gene)
BLAST • The number of DNA and protein sequences in public databases is very large NCBI Protein database has ~38,500,000 protein sequences • Searching a database involves aligning the query sequence to each sequence in the database, to find significant local alignmentseg. predictedprotein from a Database sequences Bcandidate gene TARQDEFGGA(ORF) Align A to VIVADAVIS Database IRYDDEQAKMQuery sequence A each B KQIRALQPSTQRE GHQIALMPLKMVQRR VIVALASVEGAS ASTILHGGQWLC etc. etc.
BLAST• Needleman-Wunsch & Smith-Waterman are too slow for searching databases• Fast ‘heuristic’ methods are used eg. BLAST N.B. ‘heuristic’ means they’re not guaranteed to find the best solution (best alignment here), but they work okay• BLAST was developed by Stephen Altschul & colleagues at NCBI in 1990 NCBI = National Center for Biotechnology Information (USA) BLAST = ‘Basic Local Alignment Search Tool’• The most used bioinformatics program Altschul’s 1997 paper on BLAST has been cited >26,000 times!
There are two main steps in BLAST1 It makes a list of words of length k (eg. k = 3 amino acids) in the query sequence It then looks for database sequences that share these words Database sequences that share many words with the query are used for the final alignments (step 2 ) Query sequence ADSKLWLLFKSLMNDKPFKKADFF 3-bp words ADS DSK SKL ... Database sequence 1 HIRTHIQLEQEWDSALIAAIQLE Doesn’t share words Database sequence 2 etc. PDADSTESKLAKAIQLFVCTTILCYT Shares ADS SKL words
2 For a database sequence that shares many words with the query, it makes an alignment A local alignment of the query & the database sequence The alignment contains the initial region with shared words However, the alignment may extend beyond that initial region• BLAST finds islands of similarity between sequences Given two sequences A and B, BLAST makes local alignments of pairs of subsequences of A and B A alignment 1 alignment 2 alignment 3 B• BLAST reports local alignments between the query sequence A and a database sequence B
• You can use BLAST to search many sequence databases (eg. NCBI or UniProt) via websites• Compares a DNA/protein query sequence to a sequence database and calculates the statistical significance (P-value) of matches• Website for searching GenBank and other NCBI sequence databases: http://www.ncbi.nlm.nih.gov/BLAST Can be used to search the NCBI Nucleotide database (DNA sequences), as well as the NCBI Protein database• There are 4 different types of BLAST search: BLASTP: searches a protein database with a protein query BLASTN: searches DNA/RNA database with DNA/RNA query BLASTX: searches a protein database with DNA/RNA query TBLASTN: searches DNA/RNA database with protein query
FASTA format• Many programs for sequence analysis/alignment (eg. CLUSTAL) expect the input sequences to be in FASTA format Each sequence is preceded by a header line that starts with “>” followed by the sequence identifier >fruitfly MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGR PLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLA AQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLG TRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENS NGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDS PNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPR LNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVL SAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSP WV >human MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ >mouse MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ
• You can use BLAST to search many sequence databases (eg. NCBI or UniProt) via websites eg., we can use the fruitfly Eyeless protein sequence as a BLAST query sequence to search the UniProt database: MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGV NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQE NVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYE KLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPP Fruitfly Eyeless (898 amino acids long) NDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLA GKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGID SSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSF NHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAAS SASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV We go to www.uniprot.org and click on ‘Blast’ at the top:
• You will get a list of BLAST hits (database sequences with good alignments to your query, ie. to fruitfly Eyeless here):
• Each BLAST hit may have several local alignments to the query sequence eg. the fruitfly Eyeless has human Eyeless as a BLAST hit, and several local alignments are reported for this pair:
• BLAST assesses the statistical significance of high- scoring databases matches• For each alignment between the query and a database protein, it calculates an E-value• E-value: the number of database matches of a certain alignment score expected by chance, in a database of the size searched• The lower the E-value, the more significant the alignment score for the sequence match E=1 means that we expect 1 match of that alignment score just by chance, in a database of the size searched E=10-5 means that we expect to see 10-5 matches of that alignment score just by chance, in a database of that size
• Significant BLAST hits are possibly homologues• We use the E-value to judge if the database sequence is a homologue of the query If E ≤ 10-5, we are confident that the hit is a homologue If E is 10-5―10, we are not sure if the hit is a homologue If E is > 10, we are doubtful that the hit is a homologue eg. searching UniProt using fruitfly Eyeless as our query:
eg. searching the NCBI Protein Database using fruitfly Eyeless as our query:............ BLAST matches with high E-values may not be homologues (although it is often hard to tell if they are or not!)
Problem• Here’s the output of a BLAST search using the predicted protein for a gene prediction from Staphylococcus aureus: (i) What does an E value of 189 mean? (ii) Based on the BLAST output, do you think the gene prediction is likely to correspond to a real gene? If so, can you suggest the biological function of that gene?
Answer• Here’s the output of a BLAST search using the predicted protein for a gene prediction from Staphylococcus aureus: (i) What does an E value of 189 mean? An E-value of 189 means that we expect to see 189 BLAST hits with an alignment score as high as the top BLAST hit (ie. 28.9) by chance, when we search a database of the size searched (ii) Based on the BLAST output, do you think the gene prediction is likely to correspond to a real gene? If so, can you suggest the biological function of that gene? An E-value of 189 is high, so we can’t be confident the top BLAST hit is a homologue of our query. We shouldn’t predict the function of our query sequence based on such a weak BLAST hit
Further Reading• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn• Chapter 6 in Deonier et al Computational Genome Analysis