Note: made image by aligning Uniprot PTCH2_HUMAN to C. elegans TRA2 protein (CE23546 from WormBase) using Ssearch (Smith-Waterman algorithm) and viewing the alignment using Jalview. These alignment contains the Patched domain. Image credit (human): http://www.ensembl.org/img/species/pic_Homo_sapiens.png Image credit (C. elegans): http://www.ensembl.org/img/species/pic_Caenorhabditis_elegans.png
Note: made image by aligning Uniprot PTCH2_HUMAN to C. elegans TRA2 protein (CE23546 from WormBase) using Ssearch (Smith-Waterman algorithm) and viewing the alignment using Jalview. Image credit (human): http://www.ensembl.org/img/species/pic_Homo_sapiens.png Image credit (C. elegans): http://www.ensembl.org/img/species/pic_Caenorhabditis_elegans.png
Statistical significance of alignments
Statistical Significance of Alignments Dr Avril Coghlan firstname.lastname@example.orgNote: this talk contains animations which can only be seen bydownloading and using ‘View Slide show’ in Powerpoint
Biological importance of alignments• A sequence alignment represents a hypothesis about the homology of individual positions in different sequences: Hypothesis: Y in seqs 1,2 is homologous to E in seqs 3,4• Based on an alignment, we quantify similarity• Sequence similarity suggests a shared evolutionary history Furthermore, proteins with very similar sequences probably have similar biological functions
• Once we have an alignment between 2 sequences, we can calculate their similarity over their lengths A measure of similarity is percent identity, ie. number of identical amino acids * 100 / length of the alignment eg. the alignment below is 39 amino acids long, & the human & fruitfly sequences differ at 1 position → Human & fruitfly sequences have a percent identity of (38*100/39 =) 97% in this part of the Eyeless PAX domain 12 14 16 18 20 22 24 26 28 30 32 34 36 38 1 2 3 4 5 6 7 8 9 10 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39HumanMouseCatSea squirtFruitfly Human and fruitfly Eyeless proteins differ at this position
Similarity versus homology• Homologues are similar because they had a common ancestor eg. eyeless homologues• After aligning two sequences, we can say they are 99% similar, or 50 similar, etc. Very similar sequences are V I V A L A S V E G 90% similar V I V A V A S V E G probably homologues• Any 2 random sequences are similar to some extent, so similarity doesn’t necessarily imply homology Sequences with very low V I V A L A S V E G 10% similar T S Y A V F G R T W similarity may be homologues
Similarity versus homology• Two girls are either sisters or not• Two sequences are either homologues or not Incorrect! V I V A L A S V E G 90% similar “90% homologous” V I V A V A S V E G Incorrect! V I V A L A S V E G 10% similar “10% homologous” T S Y A V F G R T W
A key question is:• How does one interpret minimal similarity? Are the sequences actually related, or is the alignment by chance? Q K G S Y Q E K G Y C | | | Q Q E S G P V R S T C
Statistical analysis of alignments• We’ve calculated the score for the best alignment between 2 sequences A and B, but is it due to chance or biology?• Sequences accumulate substitutions over millions of years, so it is sometimes hard to decide if 2 sequences are homologous• Unrelated sequences may be somewhat similar due to chance
In humans, mutations in the PTCH2 gene are a cause of brain tumours and skin cancers In the nematode Caenorhabditis elegans, the tra-2 gene functions in development to determine the sex of the embryo C. elegans adults can be male (make sperm) or hermaphrodite (make sperm & eggs)Alignment of human PTCH2 & Caenorhabditis elegans TRA2 (score = 136):Are human PTCH2 and C. elegans tra-2 homologues?
Statistical significance of the alignment• To decide if we two sequences are likely to be homologues (related), we calculate the statistical significance of the alignment score• To do this, we first need a null model (background model), ie. a statistical model that will let us calculate what we expect There are many proteins in all the different species 2 randomly chosen proteins are expected to be unrelated Our null model should therefore describe the alignment scores expected for pairs of unrelated sequences
• How can we know the alignment scores for pairs of unrelated protein sequences? We could generate random protein sequences, & calculate alignment scores for pairs of random protein sequences We can use a multinomial model to generate random protein sequences ie. make a roulette wheel with different fractions of the wheel labelled for each of the 20 amino acids Then spin thin wheel n times to make a random protein sequences that is n amino acids long In this multinomial model, p(P)=0.14, p(A)=0.28, p(W)=0.14, p(H)=0.14, p(E)=0.28 All the other amino acids have probabilities of zero here
• A good multinomial model for random sequences should take in the sequence composition eg. we could use a multinomial model to generate random sequences of the same composition as C. elegans TRA2 ie. make a roulette wheel where the fraction of the circle labelled with each of the 20 amino acids is set equal to the % of that amino acid in the TRA2 sequence
• One way to see if an alignment score is statistically significant is to compare it to the scores for alignments of random sequences We make a random sequence of the same length amino acid composition as one of our original 2 sequences (eg. TRA2) ie. use our ‘TRA2’ multinomial model to do make a sequenceAlignment of human PTCH2 & a random sequence generated using a multinomialmodel (with the probabilities of amino acids set equal to their fractions in TRA2)(score = 51):
• We can generate 200 random sequences using our TRA2-like multinomial model For each random sequence, we can calculate the best alignment score for the random sequence and human PTCH2
Compare the scores obtained with the score seen for PTCH2 & TRA2 eg. Alignment score forNumber of proteins PTCH2 & alignments TRA2 of random sequences Alignment score 5% of scores for alignments of random sequencesWhat % of the random sequences have a score equal to or higher than that for TRA2 & PTCH2? eg. 0.95 in the pictureThis method can be used to estimate the significance of alignments in the form of P-values, eg. P=0.05 in the pictureWe accept the alignment as significant (indicating probable homology) if the score is in the top 5% (or another chosen value) of the scores for random sequences, ie. if P ≤ 0.05
eg. for human PTCH2 and C. elegans TRA2: The alignment score is 136 When 200 random sequences (generated with a ‘TRA2’ multinomial model) were aligned to PTCH2, only 0.36% alignments had a score of ≥136 Therefore, we estimate a P-value of P=0.0036 ie. we estimate that the probability of getting a score of 136 for PTCH2 and TRA2 due to chance is 0.0036 (36/10,000)Alignment of human PTCH2 & C. elegans TRA2 (score = 136): Human PTCH2 and C. elegans tra-2 are probably homologues
In the example below, 0.95 of the random sequences have an alignment score equal to or higher to that for A & B, so P=0.95 Alignment score for a Number of different A & B alignments of random sequences Alignment score 95% of scores for alignments of random sequencesAlignment of fruitfly Eyeless & C. elegans TRA2 (score = 78):P=1 eyeless and tra-2 are probably not homologues
Further Reading• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn• Chapter 6 in Deonier et al book Computational Genome Analysis• Practical on alignment in R in the Little Book of R for Bioinformatics: https://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.