Statistical significance of alignments

2,225 views

Published on

Published in: Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,225
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
85
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Image credit (Williams sisters): http://media-2.web.britannica.com/eb-media/24/79824-004-7C20393C.jpg Image credit (Marlyn Monroe): http://cm1.theinsider.com/media/0/81/12/Marilyn-Monroe-11.0.0.0x0.432x594.jpeg
  • Note: made image by aligning Uniprot PTCH2_HUMAN to C. elegans TRA2 protein (CE23546 from WormBase) using Ssearch (Smith-Waterman algorithm) and viewing the alignment using Jalview. These alignment contains the Patched domain. Image credit (human): http://www.ensembl.org/img/species/pic_Homo_sapiens.png Image credit (C. elegans): http://www.ensembl.org/img/species/pic_Caenorhabditis_elegans.png
  • Note: made image by aligning Uniprot PTCH2_HUMAN to C. elegans TRA2 protein (CE23546 from WormBase) using Ssearch (Smith-Waterman algorithm) and viewing the alignment using Jalview. Image credit (human): http://www.ensembl.org/img/species/pic_Homo_sapiens.png Image credit (C. elegans): http://www.ensembl.org/img/species/pic_Caenorhabditis_elegans.png
  • Statistical significance of alignments

    1. 1. Statistical Significance of Alignments Dr Avril Coghlan alc@sanger.ac.ukNote: this talk contains animations which can only be seen bydownloading and using ‘View Slide show’ in Powerpoint
    2. 2. Biological importance of alignments• A sequence alignment represents a hypothesis about the homology of individual positions in different sequences: Hypothesis: Y in seqs 1,2 is homologous to E in seqs 3,4• Based on an alignment, we quantify similarity• Sequence similarity suggests a shared evolutionary history Furthermore, proteins with very similar sequences probably have similar biological functions
    3. 3. • Once we have an alignment between 2 sequences, we can calculate their similarity over their lengths A measure of similarity is percent identity, ie. number of identical amino acids * 100 / length of the alignment eg. the alignment below is 39 amino acids long, & the human & fruitfly sequences differ at 1 position → Human & fruitfly sequences have a percent identity of (38*100/39 =) 97% in this part of the Eyeless PAX domain 12 14 16 18 20 22 24 26 28 30 32 34 36 38 1 2 3 4 5 6 7 8 9 10 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39HumanMouseCatSea squirtFruitfly Human and fruitfly Eyeless proteins differ at this position
    4. 4. Similarity versus homology• Homologues are similar because they had a common ancestor eg. eyeless homologues• After aligning two sequences, we can say they are 99% similar, or 50 similar, etc. Very similar sequences are V I V A L A S V E G 90% similar V I V A V A S V E G probably homologues• Any 2 random sequences are similar to some extent, so similarity doesn’t necessarily imply homology Sequences with very low V I V A L A S V E G 10% similar T S Y A V F G R T W similarity may be homologues
    5. 5. Similarity versus homology• Two girls are either sisters or not• Two sequences are either homologues or not Incorrect! V I V A L A S V E G 90% similar “90% homologous” V I V A V A S V E G Incorrect! V I V A L A S V E G 10% similar “10% homologous” T S Y A V F G R T W
    6. 6. A key question is:• How does one interpret minimal similarity? Are the sequences actually related, or is the alignment by chance? Q K G S Y Q E K G Y C | | | Q Q E S G P V R S T C
    7. 7. Statistical analysis of alignments• We’ve calculated the score for the best alignment between 2 sequences A and B, but is it due to chance or biology?• Sequences accumulate substitutions over millions of years, so it is sometimes hard to decide if 2 sequences are homologous• Unrelated sequences may be somewhat similar due to chance
    8. 8. In humans, mutations in the PTCH2 gene are a cause of brain tumours and skin cancers In the nematode Caenorhabditis elegans, the tra-2 gene functions in development to determine the sex of the embryo C. elegans adults can be male (make sperm) or hermaphrodite (make sperm & eggs)Alignment of human PTCH2 & Caenorhabditis elegans TRA2 (score = 136):Are human PTCH2 and C. elegans tra-2 homologues?
    9. 9. Statistical significance of the alignment• To decide if we two sequences are likely to be homologues (related), we calculate the statistical significance of the alignment score• To do this, we first need a null model (background model), ie. a statistical model that will let us calculate what we expect There are many proteins in all the different species 2 randomly chosen proteins are expected to be unrelated Our null model should therefore describe the alignment scores expected for pairs of unrelated sequences
    10. 10. • How can we know the alignment scores for pairs of unrelated protein sequences? We could generate random protein sequences, & calculate alignment scores for pairs of random protein sequences We can use a multinomial model to generate random protein sequences ie. make a roulette wheel with different fractions of the wheel labelled for each of the 20 amino acids Then spin thin wheel n times to make a random protein sequences that is n amino acids long In this multinomial model, p(P)=0.14, p(A)=0.28, p(W)=0.14, p(H)=0.14, p(E)=0.28 All the other amino acids have probabilities of zero here
    11. 11. • A good multinomial model for random sequences should take in the sequence composition eg. we could use a multinomial model to generate random sequences of the same composition as C. elegans TRA2 ie. make a roulette wheel where the fraction of the circle labelled with each of the 20 amino acids is set equal to the % of that amino acid in the TRA2 sequence
    12. 12. • One way to see if an alignment score is statistically significant is to compare it to the scores for alignments of random sequences We make a random sequence of the same length amino acid composition as one of our original 2 sequences (eg. TRA2) ie. use our ‘TRA2’ multinomial model to do make a sequenceAlignment of human PTCH2 & a random sequence generated using a multinomialmodel (with the probabilities of amino acids set equal to their fractions in TRA2)(score = 51):
    13. 13. • We can generate 200 random sequences using our TRA2-like multinomial model For each random sequence, we can calculate the best alignment score for the random sequence and human PTCH2
    14. 14. Compare the scores obtained with the score seen for PTCH2 & TRA2 eg. Alignment score forNumber of proteins PTCH2 & alignments TRA2 of random sequences Alignment score 5% of scores for alignments of random sequencesWhat % of the random sequences have a score equal to or higher than that for TRA2 & PTCH2? eg. 0.95 in the pictureThis method can be used to estimate the significance of alignments in the form of P-values, eg. P=0.05 in the pictureWe accept the alignment as significant (indicating probable homology) if the score is in the top 5% (or another chosen value) of the scores for random sequences, ie. if P ≤ 0.05
    15. 15. eg. for human PTCH2 and C. elegans TRA2: The alignment score is 136 When 200 random sequences (generated with a ‘TRA2’ multinomial model) were aligned to PTCH2, only 0.36% alignments had a score of ≥136 Therefore, we estimate a P-value of P=0.0036 ie. we estimate that the probability of getting a score of 136 for PTCH2 and TRA2 due to chance is 0.0036 (36/10,000)Alignment of human PTCH2 & C. elegans TRA2 (score = 136): Human PTCH2 and C. elegans tra-2 are probably homologues
    16. 16. In the example below, 0.95 of the random sequences have an alignment score equal to or higher to that for A & B, so P=0.95 Alignment score for a Number of different A & B alignments of random sequences Alignment score 95% of scores for alignments of random sequencesAlignment of fruitfly Eyeless & C. elegans TRA2 (score = 78):P=1 eyeless and tra-2 are probably not homologues
    17. 17. Further Reading• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn• Chapter 6 in Deonier et al book Computational Genome Analysis• Practical on alignment in R in the Little Book of R for Bioinformatics: https://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html

    ×