SlideShare a Scribd company logo
BLAST

                    Dr Avril Coghlan
                   alc@sanger.ac.uk

Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
• Sequence alignment has many uses
  Sequence assembly – genome sequences are assembled by using
  sequence alignment methods to find overlaps between many short
  pieces of DNA
  Gene finding – alignment of whole genome sequences from two or more
  species can aid in discovery of previously unknown genes
   Sequence divergence – the amount of sequence similarity between
  sequences (which can be calculated from a sequence alignment) tells us
  how closely they are related
  Database searching – we use fast sequence alignment methods (eg.
  BLAST) to determine whether a protein/DNA sequence is similar to any
  known sequence
  Prediction of function – if we know the function of a sequence, we can
  predict the function of similar sequences identified by database searching
  (eg. for fruitfly eyeless gene)
BLAST
  • The number of DNA and protein sequences in public
    databases is very large
      NCBI Protein database has ~38,500,000 protein sequences
  •   Searching a database involves aligning the query sequence to each
      sequence in the database, to find significant local alignments


eg. predicted
protein from a
                                 Database sequences B
candidate gene                          TARQDEFGGA
(ORF)            Align A to            VIVADAVIS                   Database
                                       IRYDDEQAKM
Query sequence A    each B             KQIRALQPSTQRE
                                       GHQIALMPLKMVQRR
 VIVALASVEGAS                          ASTILHGGQWLC
                                          etc. etc.
BLAST
• Needleman-Wunsch & Smith-Waterman are too slow
  for searching databases
• Fast ‘heuristic’ methods are used eg. BLAST
  N.B. ‘heuristic’ means they’re not guaranteed to find the best solution
        (best alignment here), but they work okay
• BLAST was developed by Stephen Altschul &
  colleagues at NCBI in 1990
  NCBI = National Center for Biotechnology Information (USA)
  BLAST = ‘Basic Local Alignment Search Tool’
• The most used bioinformatics program
  Altschul’s 1997 paper on BLAST has been cited >26,000 times!
There are two main steps in BLAST
1 It makes a list of words of length k (eg. k = 3 amino
  acids) in the query sequence
  It then looks for database sequences that share these words
  Database sequences that share many words with the query are used for
  the final alignments (step 2 )


       Query sequence         ADSKLWLLFKSLMNDKPFKKADFF
            3-bp words        ADS
                               DSK
                                SKL
                                  ...
  Database sequence 1         HIRTHIQLEQEWDSALIAAIQLE               Doesn’t
                                                                    share
                                                                    words
  Database sequence 2
      etc.                    PDADSTESKLAKAIQLFVCTTILCYT Shares
                                ADS SKL                  words
2 For a database sequence that shares many words
  with the query, it makes an alignment
  A local alignment of the query & the database sequence
  The alignment contains the initial region with shared words
  However, the alignment may extend beyond that initial region
• BLAST finds islands of similarity between sequences
  Given two sequences A and B, BLAST makes local alignments of pairs of
       subsequences of A and B

   A
           alignment 1          alignment 2      alignment 3
       B
• BLAST reports local alignments between the query
  sequence A and a database sequence B
• You can use BLAST to search many sequence
  databases (eg. NCBI or UniProt) via websites
• Compares a DNA/protein query sequence to a
  sequence database and calculates the statistical
  significance (P-value) of matches
• Website for searching GenBank and other NCBI
  sequence databases:
  http://www.ncbi.nlm.nih.gov/BLAST
  Can be used to search the NCBI Nucleotide database (DNA
  sequences), as well as the NCBI Protein database
• There are 4 different types of BLAST search:
  BLASTP: searches a protein database with a protein query
  BLASTN: searches DNA/RNA database with DNA/RNA query
  BLASTX: searches a protein database with DNA/RNA query
  TBLASTN: searches DNA/RNA database with protein query
FASTA format
• Many programs for sequence analysis/alignment (eg.
  CLUSTAL) expect the input sequences to be in FASTA
  format
  Each sequence is preceded by a header line that starts with                                          “>”
  followed by the sequence identifier
  >fruitfly
  MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGR
  PLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLA
  AQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLG
  TRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENS
  NGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDS
  PNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPR
  LNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVL
  SAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSP
  WV
  >human
  MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC
  TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA
  LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT
  MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ
  >mouse

  MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC
  TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA
  LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT
  MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ
• You can use BLAST to search many sequence
  databases (eg. NCBI or UniProt) via websites
  eg., we can use the fruitfly Eyeless protein sequence as a BLAST query
  sequence to search the UniProt database:

  MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGV
  NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQE
  NVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYE
  KLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPP
                        Fruitfly Eyeless (898 amino acids long)
  NDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLA
  GKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGID
  SSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSF
  NHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAAS
  SASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV



  We go to www.uniprot.org and click on ‘Blast’ at the top:
• You will get a list of BLAST hits (database sequences
  with good alignments to your query, ie. to fruitfly
  Eyeless here):
• Each BLAST hit may have several local alignments to
  the query sequence
  eg. the fruitfly Eyeless has human Eyeless as a BLAST hit, and
  several local alignments are reported for this pair:
• BLAST assesses the statistical significance of high-
  scoring databases matches
• For each alignment between the query and a
  database protein, it calculates an E-value
• E-value: the number of database matches of a
  certain alignment score expected by chance, in a
  database of the size searched
• The lower the E-value, the more significant the
  alignment score for the sequence match
  E=1 means that we expect 1 match of that alignment score just by
  chance, in a database of the size searched
  E=10-5 means that we expect to see 10-5 matches of that alignment score
  just by chance, in a database of that size
• Significant BLAST hits are possibly homologues
• We use the E-value to judge if the database
  sequence is a homologue of the query
  If E ≤ 10-5, we are confident that the hit is a homologue
  If E is 10-5―10, we are not sure if the hit is a homologue
  If E is > 10, we are doubtful that the hit is a homologue
  eg. searching UniProt using fruitfly Eyeless as our query:
eg. searching the NCBI Protein Database using fruitfly Eyeless as our
  query:




............




               BLAST matches with high E-values
               may not be homologues (although it
               is often hard to tell if they are or not!)
Problem
• Here’s the output of a BLAST search using the
  predicted protein for a gene prediction from
  Staphylococcus aureus:




  (i) What does an E value of 189 mean?
  (ii) Based on the BLAST output, do you think the gene     prediction is
  likely to correspond to a real gene? If so, can   you suggest the
  biological function of that gene?
Answer
•   Here’s the output of a BLAST search using the predicted protein for a
    gene prediction from Staphylococcus aureus:




    (i) What does an E value of 189 mean? An E-value of 189 means that we
    expect to see 189 BLAST hits with an alignment score as high as the top
    BLAST hit (ie. 28.9) by chance, when we search a database of the size
    searched
    (ii) Based on the BLAST output, do you think the gene prediction is likely
    to correspond to a real gene? If so, can you suggest the biological function
    of that gene? An E-value of 189 is high, so we can’t be confident the top
    BLAST hit is a homologue of our query. We shouldn’t predict the
    function of our query sequence based on such a weak BLAST hit
Further Reading
•   Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•   Chapter 6 in Deonier et al Computational Genome Analysis

More Related Content

What's hot (20)

Sequence alignment 1
Sequence alignment 1Sequence alignment 1
Sequence alignment 1
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Scop database
Scop databaseScop database
Scop database
 
Introduction to sequence alignment
Introduction to sequence alignmentIntroduction to sequence alignment
Introduction to sequence alignment
 
Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)Sequence homology search and multiple sequence alignment(1)
Sequence homology search and multiple sequence alignment(1)
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
Structural databases
Structural databases Structural databases
Structural databases
 
Est database
Est databaseEst database
Est database
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
 
Phylogenetic analysis
Phylogenetic analysis Phylogenetic analysis
Phylogenetic analysis
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Prosite
PrositeProsite
Prosite
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
(Expasy)
(Expasy)(Expasy)
(Expasy)
 
dot plot analysis
dot plot analysisdot plot analysis
dot plot analysis
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Msa
MsaMsa
Msa
 
NCBI
NCBINCBI
NCBI
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 
Phylogenetic analysis
Phylogenetic analysisPhylogenetic analysis
Phylogenetic analysis
 

Similar to BLAST

BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)Ariful Islam Sagar
 
BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234alizain9604
 
BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)Sobia
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptxericndunek
 
Sequencedatabases
SequencedatabasesSequencedatabases
SequencedatabasesAbhik Seal
 
BIOINFORMATICS_AND_PHYLOGENY.pdf.pdf
BIOINFORMATICS_AND_PHYLOGENY.pdf.pdfBIOINFORMATICS_AND_PHYLOGENY.pdf.pdf
BIOINFORMATICS_AND_PHYLOGENY.pdf.pdfsirwansleman
 
Basic BLAST (BLASTn)
Basic BLAST (BLASTn)Basic BLAST (BLASTn)
Basic BLAST (BLASTn)Syed Lokman
 
blast presentation beevragh muneer.pptx
blast presentation  beevragh muneer.pptxblast presentation  beevragh muneer.pptx
blast presentation beevragh muneer.pptxhome
 
Database similarity searching blast and fasta
Database similarity searching blast and fastaDatabase similarity searching blast and fasta
Database similarity searching blast and fastaSwathi764350
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformaticsatmapandey
 
Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02PILLAI ASWATHY VISWANATH
 
Data base searching tool
Data base searching toolData base searching tool
Data base searching toolNithyaNandapal
 
Hands on training_biological_databases.ppt
Hands on training_biological_databases.pptHands on training_biological_databases.ppt
Hands on training_biological_databases.pptSoumen Barman
 

Similar to BLAST (20)

Blast
BlastBlast
Blast
 
Blasta
BlastaBlasta
Blasta
 
BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)BLAST (Basic local alignment search Tool)
BLAST (Basic local alignment search Tool)
 
BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234BLAST AND FASTA.pptx12345789999987544321234
BLAST AND FASTA.pptx12345789999987544321234
 
BLAST
BLASTBLAST
BLAST
 
BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)BLAST(Basic Local Alignment Tool)
BLAST(Basic Local Alignment Tool)
 
BLAST
BLASTBLAST
BLAST
 
BLAST
BLASTBLAST
BLAST
 
Lecture 5.pptx
Lecture 5.pptxLecture 5.pptx
Lecture 5.pptx
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
BIOINFORMATICS_AND_PHYLOGENY.pdf.pdf
BIOINFORMATICS_AND_PHYLOGENY.pdf.pdfBIOINFORMATICS_AND_PHYLOGENY.pdf.pdf
BIOINFORMATICS_AND_PHYLOGENY.pdf.pdf
 
Basic BLAST (BLASTn)
Basic BLAST (BLASTn)Basic BLAST (BLASTn)
Basic BLAST (BLASTn)
 
blast presentation beevragh muneer.pptx
blast presentation  beevragh muneer.pptxblast presentation  beevragh muneer.pptx
blast presentation beevragh muneer.pptx
 
Databases_L2.pptx
Databases_L2.pptxDatabases_L2.pptx
Databases_L2.pptx
 
Database similarity searching blast and fasta
Database similarity searching blast and fastaDatabase similarity searching blast and fasta
Database similarity searching blast and fasta
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
 
Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02Sequencealignmentinbioinformatics 100204112518-phpapp02
Sequencealignmentinbioinformatics 100204112518-phpapp02
 
Ncbi
NcbiNcbi
Ncbi
 
Data base searching tool
Data base searching toolData base searching tool
Data base searching tool
 
Hands on training_biological_databases.ppt
Hands on training_biological_databases.pptHands on training_biological_databases.ppt
Hands on training_biological_databases.ppt
 

More from avrilcoghlan

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club avrilcoghlan
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomesavrilcoghlan
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignmentsavrilcoghlan
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignmentavrilcoghlan
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithmavrilcoghlan
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functionsavrilcoghlan
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithmavrilcoghlan
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformaticsavrilcoghlan
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformaticsavrilcoghlan
 

More from avrilcoghlan (11)

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomes
 
Homology
HomologyHomology
Homology
 
Statistical significance of alignments
Statistical significance of alignmentsStatistical significance of alignments
Statistical significance of alignments
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functions
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformatics
 

Recently uploaded

GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...Nguyen Thanh Tu Collection
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxricssacare
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...Sayali Powar
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfjoachimlavalley1
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxJenilouCasareno
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaasiemaillard
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativePeter Windle
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPCeline George
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdfCarlosHernanMontoyab2
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...Jisc
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxakshayaramakrishnan21
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfQucHHunhnh
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345beazzy04
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxRaedMohamed3
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
 

Recently uploaded (20)

GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptxJose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
Jose-Rizal-and-Philippine-Nationalism-National-Symbol-2.pptx
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptxMatatag-Curriculum and the 21st Century Skills Presentation.pptx
Matatag-Curriculum and the 21st Century Skills Presentation.pptx
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptx
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
B.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdfB.ed spl. HI pdusu exam paper-2023-24.pdf
B.ed spl. HI pdusu exam paper-2023-24.pdf
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 

BLAST

  • 1. BLAST Dr Avril Coghlan alc@sanger.ac.uk Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
  • 2. • Sequence alignment has many uses Sequence assembly – genome sequences are assembled by using sequence alignment methods to find overlaps between many short pieces of DNA Gene finding – alignment of whole genome sequences from two or more species can aid in discovery of previously unknown genes Sequence divergence – the amount of sequence similarity between sequences (which can be calculated from a sequence alignment) tells us how closely they are related Database searching – we use fast sequence alignment methods (eg. BLAST) to determine whether a protein/DNA sequence is similar to any known sequence Prediction of function – if we know the function of a sequence, we can predict the function of similar sequences identified by database searching (eg. for fruitfly eyeless gene)
  • 3. BLAST • The number of DNA and protein sequences in public databases is very large NCBI Protein database has ~38,500,000 protein sequences • Searching a database involves aligning the query sequence to each sequence in the database, to find significant local alignments eg. predicted protein from a Database sequences B candidate gene TARQDEFGGA (ORF) Align A to VIVADAVIS Database IRYDDEQAKM Query sequence A each B KQIRALQPSTQRE GHQIALMPLKMVQRR VIVALASVEGAS ASTILHGGQWLC etc. etc.
  • 4. BLAST • Needleman-Wunsch & Smith-Waterman are too slow for searching databases • Fast ‘heuristic’ methods are used eg. BLAST N.B. ‘heuristic’ means they’re not guaranteed to find the best solution (best alignment here), but they work okay • BLAST was developed by Stephen Altschul & colleagues at NCBI in 1990 NCBI = National Center for Biotechnology Information (USA) BLAST = ‘Basic Local Alignment Search Tool’ • The most used bioinformatics program Altschul’s 1997 paper on BLAST has been cited >26,000 times!
  • 5. There are two main steps in BLAST 1 It makes a list of words of length k (eg. k = 3 amino acids) in the query sequence It then looks for database sequences that share these words Database sequences that share many words with the query are used for the final alignments (step 2 ) Query sequence ADSKLWLLFKSLMNDKPFKKADFF 3-bp words ADS DSK SKL ... Database sequence 1 HIRTHIQLEQEWDSALIAAIQLE Doesn’t share words Database sequence 2 etc. PDADSTESKLAKAIQLFVCTTILCYT Shares ADS SKL words
  • 6. 2 For a database sequence that shares many words with the query, it makes an alignment A local alignment of the query & the database sequence The alignment contains the initial region with shared words However, the alignment may extend beyond that initial region • BLAST finds islands of similarity between sequences Given two sequences A and B, BLAST makes local alignments of pairs of subsequences of A and B A alignment 1 alignment 2 alignment 3 B • BLAST reports local alignments between the query sequence A and a database sequence B
  • 7. • You can use BLAST to search many sequence databases (eg. NCBI or UniProt) via websites • Compares a DNA/protein query sequence to a sequence database and calculates the statistical significance (P-value) of matches • Website for searching GenBank and other NCBI sequence databases: http://www.ncbi.nlm.nih.gov/BLAST Can be used to search the NCBI Nucleotide database (DNA sequences), as well as the NCBI Protein database • There are 4 different types of BLAST search: BLASTP: searches a protein database with a protein query BLASTN: searches DNA/RNA database with DNA/RNA query BLASTX: searches a protein database with DNA/RNA query TBLASTN: searches DNA/RNA database with protein query
  • 8. FASTA format • Many programs for sequence analysis/alignment (eg. CLUSTAL) expect the input sequences to be in FASTA format Each sequence is preceded by a header line that starts with “>” followed by the sequence identifier >fruitfly MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGR PLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLA AQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLG TRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENS NGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDS PNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPR LNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVL SAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSP WV >human MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ >mouse MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVC TNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEA LEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFT MANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ
  • 9. • You can use BLAST to search many sequence databases (eg. NCBI or UniProt) via websites eg., we can use the fruitfly Eyeless protein sequence as a BLAST query sequence to search the UniProt database: MFTLQPTPTAIGTVVPPWSAGTLIERLPSLEDMAHKDNVIAMRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHPHSTSSYFATTYYHLTDDECHSGV NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQE NVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYE KLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPP Fruitfly Eyeless (898 amino acids long) NDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLA GKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGID SSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSF NHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAAS SASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV We go to www.uniprot.org and click on ‘Blast’ at the top:
  • 10. • You will get a list of BLAST hits (database sequences with good alignments to your query, ie. to fruitfly Eyeless here):
  • 11. • Each BLAST hit may have several local alignments to the query sequence eg. the fruitfly Eyeless has human Eyeless as a BLAST hit, and several local alignments are reported for this pair:
  • 12. • BLAST assesses the statistical significance of high- scoring databases matches • For each alignment between the query and a database protein, it calculates an E-value • E-value: the number of database matches of a certain alignment score expected by chance, in a database of the size searched • The lower the E-value, the more significant the alignment score for the sequence match E=1 means that we expect 1 match of that alignment score just by chance, in a database of the size searched E=10-5 means that we expect to see 10-5 matches of that alignment score just by chance, in a database of that size
  • 13. • Significant BLAST hits are possibly homologues • We use the E-value to judge if the database sequence is a homologue of the query If E ≤ 10-5, we are confident that the hit is a homologue If E is 10-5―10, we are not sure if the hit is a homologue If E is > 10, we are doubtful that the hit is a homologue eg. searching UniProt using fruitfly Eyeless as our query:
  • 14. eg. searching the NCBI Protein Database using fruitfly Eyeless as our query: ............ BLAST matches with high E-values may not be homologues (although it is often hard to tell if they are or not!)
  • 15. Problem • Here’s the output of a BLAST search using the predicted protein for a gene prediction from Staphylococcus aureus: (i) What does an E value of 189 mean? (ii) Based on the BLAST output, do you think the gene prediction is likely to correspond to a real gene? If so, can you suggest the biological function of that gene?
  • 16. Answer • Here’s the output of a BLAST search using the predicted protein for a gene prediction from Staphylococcus aureus: (i) What does an E value of 189 mean? An E-value of 189 means that we expect to see 189 BLAST hits with an alignment score as high as the top BLAST hit (ie. 28.9) by chance, when we search a database of the size searched (ii) Based on the BLAST output, do you think the gene prediction is likely to correspond to a real gene? If so, can you suggest the biological function of that gene? An E-value of 189 is high, so we can’t be confident the top BLAST hit is a homologue of our query. We shouldn’t predict the function of our query sequence based on such a weak BLAST hit
  • 17. Further Reading • Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn • Chapter 6 in Deonier et al Computational Genome Analysis

Editor's Notes

  1. The figure of 28,000,000 protein sequences is from searching NCBI Protein for 1:10000000000000000000000[SLEN] on 18-Feb-2011. Got 38535878 matching protein sequences. Image credit (filing cabinet): http://etc.usf.edu/clipart/13000/13089/file_cabinet_13089_lg.gif
  2. Image credit (Stephen Altschul): http://www.iscb.org/cms_addon/conferences/ismb2002/images/stephen.jpg