BITS: Basics of Sequence similarity

5,681 views

Published on

Module 2 Sequence similarity.

Part of bioinformatics training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training

Published in: Education, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,681
On SlideShare
0
From Embeds
0
Number of Embeds
314
Actions
Shares
0
Downloads
223
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • BLO CKS Su bstitution M atrices BLOSUM62 derived from blocks with at least 62 % positions conserved   low PAM/high BLOSUM optimal for short sequence alignment at small evolutionary distance, high PAM/low BLOSUM optimal for long alignment at high distance
  • Note : it has been shown that E is underestimated by factor of about 10 because databank sequences are not random
  • BITS: Basics of Sequence similarity

    1. 1. Basic bioinformatics concepts, databases and tools Module 2 Searching for similar sequences Joachim Jacob http://www.bits.vib.beUpdated February 2012http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod2-intro_H1_2012_SimSearch.pdf
    2. 2. Based on annotations, we can use text searchingto get sequences of interest (module 1) WHERE – Primary dbs – Derived dbs HOW to find sequences by keywords by literature by annotation See BITS website - module 1
    3. 3. In this module, we will look into sequencesimilarity to get and analyze sequences
    4. 4. Detecting sequence similarity is a cornerstone-type of analysis in bioinformatics Why would we like to detect similar sequences? 1. Searching in sequence databases for similar sequences 2. From a high-throughput experiment, every read needs to be aligned to a genomic reference sequence or to each other (assembly) 3. Elucidation of functionality by detecting sites of conservation (sequences parts that resemble each other more than would be expected) 4. Phylogeny is build upon comparison of multiple sequences
    5. 5. Comparing sequences can be classified into oneto one, one to many, or many to many How to search for sequence similarity One to many - in sequence databases, via BLAST or FASTA One to one - to compare the sequence of two sequences in detail: pairwise sequence comparison many to many - to compare multiple sequences at once: multiple sequence alignment, de novo assembly Methods can be categorized into: Optimal/exhaustive – heuristic – Graphical
    6. 6. Conceptualizing the source of sequencesimilarity Sequences can be similar because ... they are derived from evolutionary related organisms they evolve in similar conditions: convergent evolution
    7. 7. The source of sequence similarity So the first question we have to solve, what is similar? How do we measure similarity?Similar?
    8. 8. Lets assume this toy example, a short sequence, mutating over time (without insertions of deletions) occurring.Similar? Summary of the really occurred changes
    9. 9. So taking the most divergent sequences (the first and the last), the only correct alignment for those two, regarding their history, is:Similar?
    10. 10. But we usually dont have all intermediate sequences: only the first and the last. How to determine what is the correct alignment? KLRMWILVATAEIDDSimilar? KPRMCILVAIADIRD In addition, multiple changes can have happened at one location over time
    11. 11. Many possibilities exist to align them: drag the sequences over each other. One of those positions, will have highest number of identical residues, called matches (green) KLRMWILVATAEIDD KLRMWILVATAEIDD KPRMCILVAIADIRDSimilar? KLRMWILVATAEIDD KPRMCILVAIADIRD KPRMCILVAIADIRD KLRMWILVATAEIDD KPRMCILVAIADIRD
    12. 12. In this example, we base our claim we have a match if we see an identical residue on that position in both sequences. KLRMWILVATAEIDD KLRMWILVATAEIDD KPRMCILVAIADIRDSimilar? KLRMWILVATAEIDD KPRMCILVAIADIRD KPRMCILVAIADIRD KLRMWILVATAEIDD KPRMCILVAIADIRD
    13. 13. The identity matrix summarizes this scoringsystem, listing all residue combinations in a table Residue A C Y W A match C mismatch Y W
    14. 14. Substitutions or score matrices provide a meansto determine similarity in an objective way Such matrices are called substitution or scoring matrices. They are used to calculate a score for every possible AA alignment in aligned sequences, in order have a measure for sequence similarity. KLRMWILVATAEIDD KLRMWILVATAEIDDKPRMCILVAIADIRD KPRMCILVAIADIRD Score: 0 1 0 0 0 0 0 0 0 1 0 1 Score: 0 1 0 0 0 0 0 0 0 0 Sum of the scores: 2 Sum of the scores: 1 KLRMWILVATAEIDD KPRMCILVAIADIRD Score: 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 Sum of the scores: 11
    15. 15. Complex substitutions matrices are moremeaningful and sensitive to detect similarity The two most popular are PAM and BLOSUM. Every pair of aligned residues get a score, based on the matrix. E.g. an A-A alignment gets score 2 (PAM120) or 4 (BLOSUM62). An F-G gets -5 or -3. Likely changes: positive score - unlikely changes: negative score http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Scoring2.html http://biology.unm.edu/biology/maggieww/Public_Html/444544seqsim.html
    16. 16. Substitution matrices are derived from analysisof multiple alignments of related sequences PAM (Point Accepted Mutations) by Margaret Dayhoff : global alignments of proteins with >85% identity --> phylogenetic trees --> count substitutions --> estimate prob. conservation/substitution at distance of 1 mutation per 100 aa ==> PAM1 table PAMn tables by matrix multiplication BLOSUM (BLOCKS Substitution Matrices) by Henikoff and Henikoff : BLOCKS (= local multiple sequence alignment without gaps) databank made from protein families from PROSITE databank --> BLOSUMn table derived from BLOCK with >n% conserved aa http://en.wikipedia.org/wiki/Substitution_matrix
    17. 17. The BLOSUM62 similarity matrix A B C D E F G H I K L M N P Q R S T V W X Y Z A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1 B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4 D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2 E -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3 G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2 H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0 I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3 K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1 L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3 M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2 N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -1 -2 0 P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1 Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 -1 2 R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0 S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0 T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1 V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -1 -2 W -3 -4 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 -1 2 -3 X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Y -2 -3 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 7 -2 Z -1 2 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5 ftp://ftp.ncbi.nih.gov/blast/matrices/
    18. 18. The substitution matrices capture the similarity in properties between residuesFrom Livingstone, C. D. and Barton, G. J. (1993),"Protein Sequence Alignments: A Strategy for the Hierarchical Analysis of Residue Conservation", Comp. Appl. Bio. Sci., 9, 745-756.
    19. 19. A matrix does not capture insertions anddeletions: penalties are given to deal with them When two sequences align, the relation between aligned residues can only been seen as one of the following: identity Mismatch (substitution (DNA) or similarity level (protein)) gap (insertion/deletion)
    20. 20. The two parts of the gap penalty: a higherpenalty for creation, one lower for its extensionScore from substitution matrix Gap penalty
    21. 21. Substitutions matrices are used in manyalgorithms to detect sequence similarity How to search for sequence similarity One to many - In sequence databases: BLAST or FASTA - to compare the sequence of two sequences inOne to one detail: pairwise sequence comparison - to compare sequences of multiple sequences: many to many multiple sequence alignment 3 methods exist: Graphical – optimal/exhaustive – heuristic
    22. 22. Pairwise sequence comparison – one to one To create an alignment between two sequences - Manually (?) - Two sequences (= pairwise alignment): optimal alignment through dynamic programming – Needleman-Wunsch (global alignment) – Smith-Waterman (local alignment)
    23. 23. Dynamic programming uses a gap penalty anda scoring scheme to align two sequences Dynamic programming: two things needed Scoring scheme to measure identity and similarity • choose a scoring matrix for similarity and identity (e.g. PAM250) Gap penalty • For each gap, a penalty in the ultimate score is given, also called weight, or cost most used : a + b * (n-1) for gap of n positions a : gap opening penalty, higher penalty (negative score) b : gap extension penalty, smaller penalty to widen a gap http://biology.unm.edu/biology/maggieww/Public_Html/444544seqsim.html
    24. 24. Dynamic programming and backtracking source - A T T - A T T - 0 -1 -1 -2 -2 -3 -3 - 0 -1 -2 -3 -1 -1 -2 0 -3 -1 -4 T T -1 -1 0 -1target -1 -2 -1 -2 0 -1 -1 -2 -2 -2 0 -1 +1 -2 T -2 -3 -2 -3 0 -1 +1 T -2 -2 0 +1 -3 -3 -3 -3 -1 -1 0 C -3 -4 -3 -4 -1 -2 0 C -3 -3 -1 0 Si-1,j-1 Si,j-1 scoring scheme : • s(ai,bi) = +1 if ai = bi A T T - +s(ai,bj) +s(-,bj) • s(ai,bi) = -1 if ai = bi / - T T C Si-1,j • s(ai,-) = -1 +s(ai,-) Si,j • s(-,bi) = -1
    25. 25. Two approaches to align pairwise : alignglobally versus locally A Needleman - Wunsch algorithm considers similarity across the full B extent of the sequences global Alignment A Smith - Waterman algorithm focuses on regions of similarity in parts of the sequences B local Alignment
    26. 26. Software for creating an optimal pairwisealignment Best global alignment (Needleman – Wunsch) EMBOSS needle (webinterface here, here on Mobyle) EMBOSS stretcher (with Myers-Miller optimization, for very long sequences – webinterface on Mobyle) Best local alignment (Smith-Waterman) EMBOSS water SIM (Huang and Miller, with optimization for very long sequences, can also find non-overlapping suboptimal alignments) (link) EMBOSS matcher (idem as SIM) modified version of SIM (by Laurent Duret) with output for graphical viewer http://mobyle.pasteur.fr/cgi-bin/portal.py#welcome
    27. 27. Parameters that are set
    28. 28. A graphical method: dot plots can be made torapidly identify regions with similar sequence The parameters of a dotplot (which uses the identity matrix), are the word size (e.g. per 3 residues) and the threshold (% of a word that are identities). This is very convenient for large molecules, e.g. chromosomes
    29. 29. Software for making dotplots  EMBOSS contains following programs – dottup : word comparison – dotmatcher : window/threshold comparison – dottup : word comparison, makes n*n dotplots in one graph  Dotter developed by Erik Sonnhammer and Richard Durbin (U. Stockholm, Sweden)  Dotlet – (Java applet) at the Swiss Institute of Bioinformatics  Gepard – (Munich Information center for Protein Sequences, Germany) : with heuristic for speeding up computation, for comparing very long sequences  ... http://www.bits.vib.be/wiki/index.php/Dotplot
    30. 30. Dot plots generate typical patterns which can be interpreted Sequence A Sequence B Simple repeatInsertion in sequence BInsertion in sequence A Complex repeat Palindromehttp://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plot
    31. 31. Multiple sequence alignment How to search for sequence similarityOne to many - In sequence databases: BLAST or FASTA - to compare the sequence of two sequencesOne to one in detail: pairwise sequence comparison - to compare sequences of multiple sequences:many to many multiple sequence alignment 3 methods exist: Graphical – optimal/exhaustive – heuristic
    32. 32. Multiple sequence alignment is not simplyexpanding pairwise sequence alignmentsMany to many One could try to dynamically program to time consuming: 20 seqs need already more time than the universe has existed... Heuristic methods lead the way: "progressive alignment" most used 1. Use pairwise dynamic programming for all sequences 2. Guide tree is constructed based on scores 3. Two sequences are aligned, and sequentially every sequence is added following the guide tree (progressive clustering)
    33. 33. Progressive clustering is a two-step process:measuring distance and constructing alignmentN (N-1) pairwise sequence multiple sequence 2 alignments alignmentSTEP 1: progressive alignment STEP 2:Measure "once a gap, constructsimilarity always a gap" MSA A B C Take mean of B 142 progressive clustering ABC Take mean C 95 101 Of AB D 60 62 55 D C B A similarity matrix guide tree
    34. 34. Progressive clustering: once a gap, always a gap
    35. 35. The guide tree is NOT a phylogenetic tree !
    36. 36. The progressive alignment framework can beextended to make it faster and more sensitive More sensitive: - consistency: per position scoring scheme (T-Coffee) - structural guidance: based on structural info alignment is guided (Expresso) Faster: - distance measured by analysing k-tuples (see later) instead of pairwise aligning (Clustal Omega) http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.0030123
    37. 37. Different formats of aligned sequences exists Clustal format h1_pea -MATEEPIVAVETVPEPIVTEPTTITEPEVPEKEEPKAEVEKTKKAKGSKPKKASKPRNP h1_sollc -MATEEPVIVNEVVEEQAA--PETVKDEANPPAKSGKAKKETKAKKPAAPRKRSATP--- h11_volca MSETEAAPVVAPAAEAAPAAEAPKAKAPKAKAPKQPKAPKAPKEPKAPKEKKPKAAP--- h1_pea ASHPTYEEMIKDAIVSLKEKNGSSQYAIAKFIEEKQ-KQLP-ANFKKLLLQNLKKNVASG h1_sollc -THPPYFEMIKDAIVTLKERTGSSQHAITKFIEEKQ-KSLP-SNFKKLLLTQLKKFVASE h11_volca -THPPYIEMVKDAITTLKERNGSSLPALKKFIENKYGKDIHDKNFAKTLSQVVKTFVKGG Phylip format 3 298 h1_pea -MATEEPIVA VETVPEPIVT EPTTITEPEV PEKEEPKAEV EKTKKAKGSK h1_sollc -MATEEPVIV NEVVEEQAA- -PETVKDEAN PPAKSGKAKK ETKAKKPAAP h11_volca MSETEAAPVV APAAEAAPAA EAPKAKAPKA KAPKQPKAPK APKEPKAPKE PKKASKPRNP ASHPTYEEMI KDAIVSLKEK NGSSQYAIAK FIEEKQ-KQL RKRSATP--- -THPPYFEMI KDAIVTLKER TGSSQHAITK FIEEKQ-KSL KKPKAAP--- -THPPYIEMV KDAITTLKER NGSSLPALKK FIENKYGKDI http://www.bits.vib.be/wiki/index.php/Multiple_sequence_alignment#Formats
    38. 38. Software that implement these algorithms andmanually adjust the alignments Alignment editors - SeaView - SeqPup - GeneDoc - Jalview - BioEdit - CLC Sequence Viewer - UGene http://www.bits.vib.be/wiki/index.php/Multiple_sequence_alignment
    39. 39. Additional references Notredame et al. (2002), “T-Coffee: A novel method for fast and accurate multiple sequence alignment,” J Mol Biol 302:205. [Introduced notion of consistency] Blackshields et al. (2010), “Sequence embedding for fast construction of guide trees for multiple sequence alignment,” Algorithms for Mol Biol 5:21. [mBed algorithm] Söding (2005), “Protein homology detection by HMM-HMM comparison,” Bioinformatics 21:951. [HHalign algorithm] Thompson et al. (2005), “BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark,” Proteins 61:127.
    40. 40. Divided in one to one, one to many, or many to many sequence comparisons How to search for sequence similarityOne to many - In sequence databases: BLAST or FASTA - to compare the sequence of two sequences in One to one detail: pairwise sequence comparison - to compare sequences of multiple sequences: many to many multiple sequence alignment 3 methods exist: Graphical – optimal/exhaustive – heuristic
    41. 41. Searching sequence databases is done through alittle trick Problem find me all similar sequences to a query sequence in a database. (Find me the position of many short reads in a genome) Bottleneck: we cannot compute an optimal alignment for every sequence and determine which is best (~MSA). This is time-consuming, only practicable on special computer (parallel computer or computer cluster) "Heuristic" algorithm : gain of speed at the expense of some loss in sensitivity • BLAST (developed by S. Altschul et al. at NCBI) • fastA (developed by R. Pearson at U. of Virginia)
    42. 42. BLAST finds quickly similar sequences by givingup some sensitivity Algorithm (= steps to follow to reach your goal) http://www.ncbi.nlm.nih.gov/books/NBK21097/
    43. 43. BLAST step 1 : neighbouringBLAST step 2 : searching the little words in the db
    44. 44. BLAST step 3 : extend where the words match Only if words match >Sg score Proteins: only extension if another hit <40 Proteins: optimal composition adapted
    45. 45. Each BLAST search hit has an E-value, which ishow many hits we expect by chance Expect value : number of unrelated databank sequences expected to yield same or higher score S by pure chance (extreme value distribution) E() = m * n * K * e - λ ∗ Squery sequence length m total databank length n K and λ parameters obtained by simulation (search random sequence against random databank) http://bioinfo.uncc.edu/zhx/binf8312/lecture-7-SequenceAnalyses.pdf
    46. 46. BLAST statistics Probability P that databank yields by pure chance at least one alignment with same or higer score P = 1 - e -E bit score : score corrected for scale of scoring scheme λ * S - ln K S’= ln 2
    47. 47. Interpreting the BLAST results by E-value and bitscore E-value: the lower the better (= chance to obtain such a similarity by chance with a random sequence and database of the same size) (e.g. 0.1 means 1 in 10 searches, this similarity could have arosen by chance alone) Max/Total score: bit score – the higher the better (= score constructed from length of total alignment of the high scoring pair)
    48. 48. Depending on DNA and/or protein sequences asquery or in the db, you choose a BLAST version Different flavours of BLAST Depending on query sequence: DNA or protein and database: DNA or protein Flavour: query - database blastn: DNA - DNA blastp: protein - protein blastx: translated DNA - protein tblastn: protein - tr DNA tblastx: tr DNA - tr DNA
    49. 49. You can adjust few parameters to the BLASTalgorithm E-value threshold for searching: rule of thumb: Good >1e-05 > weak similarity >1e-01> take a good look > 10 Higher word size = sensitivity up SEG filter for proteins DUST filter for nucleic acids
    50. 50. A lot of power lies within choosing the rightdatabase for the BLAST search.The choice of the database The "nr/nt" database is the largest nucleotide database available through NCBI BLAST; select the "nr/nt" database for this exercise. It includes all GenBank, RefSeq Nucleotides, EMBL (European nucleotide database), DDBJ (Japanese nucleotide database) and PDB (Protein Data Bank) sequences, but no EST, STS, GSS, or phase 0, 1 or 2 htgs (unfinished high throughput genomic) sequences. The NCBI nr database originally got its name from the phrase "nonredundant" nucleotide database, but there is no longer any claim to nonredundancy in the sequence set.
    51. 51. Nearly every sequence database comes withBLAST services nowadays Numerous online websites, mostly WU-BLAST (NCBI) http://blast.ncbi.nlm.nih.gov/Blast.cgi http://www.ebi.ac.uk/Tools/sss/ But very easy to install on own computer (run locally) 1. Download blast programs ( here ) 2. Format your database (multifasta file) 3. Run BLAST You can also choose to use NCBI Blast online outside of the browser by using netblast (instructions here) http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/
    52. 52. Some adjustments to the BLAST protocol existfor particular purposes Identifying very distantly related proteins PSI-BLAST (position specific iterated) (see module 3) BLAST protein with matching of a pattern PHI-BLAST (pattern hit initiated) (see module 3) BLAST highly similar nucleotide sequences Mega-BLAST LastZ explanation – have a look at the dotplots here
    53. 53. BLAST2SEQ aligns 2 sequences and visualisesthe output in a dotplot-like graph The tool to do this is called BLAST2SEQ: e.g. comparing chrI with ChrVIII of S. cerevisiae insertions! http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DE
    54. 54. BLAT is derived from BLAST, and is used forsearching very similar sequences in a genome BLAT = BLAST like alignment tool - database is a genome sequence - the database are not files, but is kept into memory as words of size 11 - it can only be used for very similar sequences: if you have a fragment which you want to know the position in the genome.
    55. 55. The technique of indexing words is also used insome short read aligners In BLAST for nucleotides: k = 11 (11-mers) 11111111111 → 11 consecutive matches However, non-consecutive matches improve sensitivity: a spaced seed. 111010010100110111 → 55% more sensitive • 1 means a match, a 0 means a dont care position – Key size: number of 1s – Key width: total number of 0s and 1s • The keys are used to index the genome or the reads, depending on the aligner doi: 10.1093/bib/bbq015 on http://dx.doi.org
    56. 56. FastA is another popular sequence databasesearch algorithm Rescore using PAM matrix and Find runs of identities Keep top scoring segments
    57. 57. FastA is another popular sequencedatabase search algorithm Apply joining threshold to Use dynamic programming to eliminate segments that are optimize the alignment in a unlikely to be part of the alignment narrow band that encompasses that includes the highest the topscoring segments scoring alignment
    58. 58. FastA is accessible on the website of EBI Further explanation of algorithm: here Accessibility • EBI (help link) FastA developers: link
    59. 59. The interpretation of FastA output is similar asfor BLAST. http://www.ebi.ac.uk/Tools/sss/fasta/
    60. 60. Similarity you observe, homology you infer Interpreting results Sequences are similar if their similarity score is significantly higher than that of random sequences of same length and composition. Sequences are homologous if they are similar because they diverged from a common ancestor. Sequences are analogous if they are similar because of convergent evolution (e.g. binding sites for same ligand) Similarity you observe, homology you infer ! You can speak of %similarity or %identity, not of %homology !
    61. 61. Homology: orthologous and paralogous(in- and out-) (in) (out)
    62. 62. Summary sequence similarity Pairwise (one to one) – Dotplot (graphical) – Smith-waterman / needleman-wunsch (optimal) Multiple sequence alignment (many to many) (heuristic) – ClustalW – Muscle, ... Database search (one to many) (heuristic) – BLAST – FastA – BLAT
    63. 63. What you can check to stay updated?Biocatalogue http://www.biocatalogue.org/EMBRACE http://www.embraceregistry.net/Bioinformatics Links Directory http://www.bioinformatics.ca/links_directory/
    64. 64. Summary Detecting sequence similarity is a cornerstone-type of analysis in bioinformatics The identity matrix summarizes this scoring system, listing all residue combinations in a table Substitutions or score matrices provide a means to determine similarity in an objective way Complex substitutions matrices are more meaningful and sensitive to detect similarity Substitution matrices are derived from analysis of multiple alignments of related sequences The substitution matrices capture the similarity in properties between residues A matrix does not capture insertions and deletions: penalties are given to deal with them The two parts of the gap penalty: a higher penalty for creation, one lower for its extension Dynamic programming uses a gap penalty and a scoring scheme to align two sequences Needleman-Wunsch to align two sequences over the whole length (global alignment) Smith-Waterman to align the most similar parts of two sequences (local alignment) A graphical method: dot plots can be made to rapidly identify regions with similar sequence Dot plots generate typical patterns which can be interpreted Multiple sequence alignment is not simply expanding pairwise sequence alignments Searching sequence databases is done through a little trick BLAST finds quickly similar sequences by giving up some sensitivity Depending on DNA and/or protein sequences as query or in the db, you choose a BLAST version You can adjust few parameters to the BLAST algorithm A lot of power lies within choosing the right database for the BLAST search. Some adjustments to the BLAST protocol exist for particular purposes BLAST2SEQ aligns 2 sequences and visualises the output in a dotplot-like graphh BLAT is derived from BLAST, and is used for searching very similar sequences in a genome The technique of indexing words is also used in some short read aligners FastA is another popular sequence database search algorithm FastA is accessible on the website of EBI The interpretation of FastA output is similar as for BLAST. Similarity you observe, homology you infer Homology: orthologous and paralogous (in- and out-)

    ×