SlideShare a Scribd company logo
1 of 17
Statistical Significance of
               Alignments

                    Dr Avril Coghlan
                   alc@sanger.ac.uk

Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
Biological importance of alignments
• A sequence alignment represents a hypothesis about
  the homology of individual positions in different
  sequences:


                       Hypothesis: Y in seqs 1,2 is homologous to E in seqs 3,4
• Based on an alignment, we quantify similarity
• Sequence similarity suggests a shared evolutionary
  history
  Furthermore, proteins with very similar sequences probably        have
  similar biological functions
• Once we have an alignment between 2 sequences,
      we can calculate their similarity over their lengths
       A measure of similarity is percent identity, ie. number of identical
             amino acids * 100 / length of the alignment
       eg. the alignment below is 39 amino acids long, & the human & fruitfly
             sequences differ at 1 position
       → Human & fruitfly sequences have a percent identity of (38*100/39 =)
             97% in this part of the Eyeless PAX domain


                                    12 14 16 18 20 22 24 26 28 30 32 34 36 38
             1 2 3 4 5 6 7 8 9 10 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Human
Mouse
Cat
Sea squirt
Fruitfly

                     Human and fruitfly Eyeless proteins differ at this position
Similarity versus homology
• Homologues are similar because they had a common
  ancestor eg. eyeless homologues
• After aligning two sequences, we can say they are
  99% similar, or 50 similar, etc.
                                       Very similar sequences are
  V I V A L A S V E G   90% similar
  V I V A V A S V E G                  probably homologues

• Any 2 random sequences are similar to some extent,
  so similarity doesn’t necessarily imply homology
                                      Sequences with very low
  V I V A L A S V E G   10% similar
  T S Y A V F G R T W                 similarity may be
                                      homologues
Similarity versus homology
• Two girls are either sisters or not
• Two sequences are either homologues or not

                                         Incorrect!
  V I V A L A S V E G   90% similar   “90% homologous”
  V I V A V A S V E G

                                         Incorrect!
  V I V A L A S V E G   10% similar   “10% homologous”
  T S Y A V F G R T W
A key question is:
• How does one interpret minimal similarity?
  Are the sequences actually related, or is the alignment by chance?



                     Q K G S Y Q E K G Y C
                     |     |             |
                     Q Q E S G P V R S T C
Statistical analysis of alignments
• We’ve calculated the score for the best alignment
  between 2 sequences A and B, but is it due to chance
  or biology?
• Sequences accumulate substitutions over millions of
  years, so it is sometimes hard to decide if 2
  sequences are homologous
• Unrelated sequences may be somewhat similar due
  to chance
In humans, mutations in the PTCH2 gene are a cause of brain tumours and
     skin cancers

     In the nematode Caenorhabditis elegans, the tra-2 gene functions in
     development to determine the sex of the embryo
     C. elegans adults can be male (make sperm) or hermaphrodite (make
     sperm & eggs)

Alignment of human PTCH2 & Caenorhabditis elegans TRA2 (score = 136):




Are human PTCH2 and C. elegans tra-2 homologues?
Statistical significance of the
                alignment
• To decide if we two sequences are likely to be
  homologues (related), we calculate the statistical
  significance of the alignment score
• To do this, we first need a null model (background
  model), ie. a statistical model that will let us
  calculate what we expect
  There are many proteins in all the different species
  2 randomly chosen proteins are expected to be unrelated
  Our null model should therefore describe the alignment scores
  expected for pairs of unrelated sequences
• How can we know the alignment scores for pairs of
  unrelated protein sequences?
  We could generate random protein sequences, & calculate
  alignment scores for pairs of random protein sequences
  We can use a multinomial model to generate random protein sequences
  ie. make a roulette wheel with different fractions of the wheel labelled
        for each of the 20 amino acids
  Then spin thin wheel n times to make a random protein sequences that
        is n amino acids long


                                       In this multinomial model,
                                       p(P)=0.14, p(A)=0.28,
                                       p(W)=0.14, p(H)=0.14, p(E)=0.28
                                       All the other amino acids have
                                       probabilities of zero here
• A good multinomial model for random sequences
  should take in the sequence composition
  eg. we could use a multinomial model to generate random sequences of
       the same composition as C. elegans TRA2

  ie. make a roulette wheel where the fraction of the circle labelled with
        each of the 20 amino acids is set equal to the % of that amino
        acid in the TRA2 sequence
• One way to see if an alignment score is statistically
   significant is to compare it to the scores for
   alignments of random sequences
    We make a random sequence of the same length amino acid
    composition as one of our original 2 sequences (eg. TRA2)
    ie. use our ‘TRA2’ multinomial model to do make a sequence


Alignment of human PTCH2 & a random sequence generated using a multinomial
model (with the probabilities of amino acids set equal to their fractions in TRA2)
(score = 51):
• We can generate 200 random sequences using our
  TRA2-like multinomial model
  For each random sequence, we can calculate the best alignment score for
  the random sequence and human PTCH2
Compare the scores obtained with the score seen for PTCH2 & TRA2 eg.

                                                    Alignment score for
Number of                                              proteins PTCH2 &
 alignments                                            TRA2
 of random
 sequences                                                Alignment
                                                             score

                                            5% of scores for alignments
                                               of random sequences

What % of the random sequences have a score equal to or higher than that
   for TRA2 & PTCH2? eg. 0.95 in the picture
This method can be used to estimate the significance of alignments in the
   form of P-values, eg. P=0.05 in the picture
We accept the alignment as significant (indicating probable homology) if the
   score is in the top 5% (or another chosen value) of the scores for random
   sequences, ie. if P ≤ 0.05
eg. for human PTCH2 and C. elegans TRA2:
  The alignment score is 136
  When 200 random sequences (generated with a ‘TRA2’ multinomial
        model) were aligned to PTCH2, only 0.36% alignments had a score of
  ≥136
  Therefore, we estimate a P-value of P=0.0036
  ie. we estimate that the probability of getting a score of 136 for PTCH2
        and TRA2 due to chance is 0.0036 (36/10,000)

Alignment of human PTCH2 & C. elegans TRA2 (score = 136):




   Human PTCH2 and C. elegans tra-2 are probably homologues
In the example below, 0.95 of the random sequences have an alignment
    score equal to or higher to that for A & B, so P=0.95
                  Alignment score for a
 Number of           different A & B
  alignments
  of random
  sequences                                                 Alignment
                                                               score
                         95% of scores for alignments of random sequences
Alignment of fruitfly Eyeless & C. elegans TRA2 (score = 78):




P=1          eyeless and tra-2 are probably not homologues
Further Reading
•   Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•   Chapter 6 in Deonier et al book Computational Genome Analysis
•   Practical on alignment in R in the Little Book of R for Bioinformatics:
    https://a-little-book-of-r-for-
    bioinformatics.readthedocs.org/en/latest/src/chapter4.html

More Related Content

What's hot (20)

Clustal
ClustalClustal
Clustal
 
Multiple Sequence Alignment
Multiple Sequence AlignmentMultiple Sequence Alignment
Multiple Sequence Alignment
 
Protein database
Protein databaseProtein database
Protein database
 
Biological networks
Biological networksBiological networks
Biological networks
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
PAM : Point Accepted Mutation
PAM : Point Accepted MutationPAM : Point Accepted Mutation
PAM : Point Accepted Mutation
 
Scop database
Scop databaseScop database
Scop database
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
Cath
CathCath
Cath
 
Secondary protein structure prediction
Secondary protein structure predictionSecondary protein structure prediction
Secondary protein structure prediction
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. local
 
BLAST
BLASTBLAST
BLAST
 
(Expasy)
(Expasy)(Expasy)
(Expasy)
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Fasta
FastaFasta
Fasta
 

Similar to Statistical significance of alignments

The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1Keiji Takamoto
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 
Computation and System Biology Assignment Help
Computation and System Biology Assignment HelpComputation and System Biology Assignment Help
Computation and System Biology Assignment HelpNursing Assignment Help
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07Computer Science Club
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptxArupKhakhlari1
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesProf. Wim Van Criekinge
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionSumit Prajapati
 
Association mapping
Association mapping Association mapping
Association mapping Preeti Kapoor
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticezahid6
 
Meiosis, linkage and crossing over
Meiosis, linkage and crossing overMeiosis, linkage and crossing over
Meiosis, linkage and crossing overblogarirahayu
 
How the blast work
How the blast workHow the blast work
How the blast workAtai Rabby
 
ppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdfppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdfPaul Gardner
 
Sequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionSequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionTenaAvdic
 

Similar to Statistical significance of alignments (20)

The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1The derivation of ungapped global protein alignment score distributions - Part1
The derivation of ungapped global protein alignment score distributions - Part1
 
Seq alignment
Seq alignment Seq alignment
Seq alignment
 
Nbt1004 1315
Nbt1004 1315Nbt1004 1315
Nbt1004 1315
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Computation and System Biology Assignment Help
Computation and System Biology Assignment HelpComputation and System Biology Assignment Help
Computation and System Biology Assignment Help
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptx
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matrices
 
Blast 2013 1
Blast 2013 1Blast 2013 1
Blast 2013 1
 
Lesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And RegressionLesson 8 Linear Correlation And Regression
Lesson 8 Linear Correlation And Regression
 
Stats chapter 15
Stats chapter 15Stats chapter 15
Stats chapter 15
 
Association mapping
Association mapping Association mapping
Association mapping
 
Presentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informaticePresentation for blast algorithm bio-informatice
Presentation for blast algorithm bio-informatice
 
Blast Algorithm
Blast AlgorithmBlast Algorithm
Blast Algorithm
 
Meiosis, linkage and crossing over
Meiosis, linkage and crossing overMeiosis, linkage and crossing over
Meiosis, linkage and crossing over
 
How the blast work
How the blast workHow the blast work
How the blast work
 
ppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdfppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdf
 
Data formats
Data formatsData formats
Data formats
 
Sequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionSequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics Introduction
 

More from avrilcoghlan

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club avrilcoghlan
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomesavrilcoghlan
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignmentavrilcoghlan
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithmavrilcoghlan
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functionsavrilcoghlan
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithmavrilcoghlan
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformaticsavrilcoghlan
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformaticsavrilcoghlan
 

More from avrilcoghlan (11)

DESeq Paper Journal club
DESeq Paper Journal club DESeq Paper Journal club
DESeq Paper Journal club
 
Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomes
 
Homology
HomologyHomology
Homology
 
BLAST
BLASTBLAST
BLAST
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functions
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformatics
 

Recently uploaded

Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 

Recently uploaded (20)

Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 

Statistical significance of alignments

  • 1. Statistical Significance of Alignments Dr Avril Coghlan alc@sanger.ac.uk Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
  • 2. Biological importance of alignments • A sequence alignment represents a hypothesis about the homology of individual positions in different sequences: Hypothesis: Y in seqs 1,2 is homologous to E in seqs 3,4 • Based on an alignment, we quantify similarity • Sequence similarity suggests a shared evolutionary history Furthermore, proteins with very similar sequences probably have similar biological functions
  • 3. • Once we have an alignment between 2 sequences, we can calculate their similarity over their lengths A measure of similarity is percent identity, ie. number of identical amino acids * 100 / length of the alignment eg. the alignment below is 39 amino acids long, & the human & fruitfly sequences differ at 1 position → Human & fruitfly sequences have a percent identity of (38*100/39 =) 97% in this part of the Eyeless PAX domain 12 14 16 18 20 22 24 26 28 30 32 34 36 38 1 2 3 4 5 6 7 8 9 10 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Human Mouse Cat Sea squirt Fruitfly Human and fruitfly Eyeless proteins differ at this position
  • 4. Similarity versus homology • Homologues are similar because they had a common ancestor eg. eyeless homologues • After aligning two sequences, we can say they are 99% similar, or 50 similar, etc. Very similar sequences are V I V A L A S V E G 90% similar V I V A V A S V E G probably homologues • Any 2 random sequences are similar to some extent, so similarity doesn’t necessarily imply homology Sequences with very low V I V A L A S V E G 10% similar T S Y A V F G R T W similarity may be homologues
  • 5. Similarity versus homology • Two girls are either sisters or not • Two sequences are either homologues or not Incorrect! V I V A L A S V E G 90% similar “90% homologous” V I V A V A S V E G Incorrect! V I V A L A S V E G 10% similar “10% homologous” T S Y A V F G R T W
  • 6. A key question is: • How does one interpret minimal similarity? Are the sequences actually related, or is the alignment by chance? Q K G S Y Q E K G Y C | | | Q Q E S G P V R S T C
  • 7. Statistical analysis of alignments • We’ve calculated the score for the best alignment between 2 sequences A and B, but is it due to chance or biology? • Sequences accumulate substitutions over millions of years, so it is sometimes hard to decide if 2 sequences are homologous • Unrelated sequences may be somewhat similar due to chance
  • 8. In humans, mutations in the PTCH2 gene are a cause of brain tumours and skin cancers In the nematode Caenorhabditis elegans, the tra-2 gene functions in development to determine the sex of the embryo C. elegans adults can be male (make sperm) or hermaphrodite (make sperm & eggs) Alignment of human PTCH2 & Caenorhabditis elegans TRA2 (score = 136): Are human PTCH2 and C. elegans tra-2 homologues?
  • 9. Statistical significance of the alignment • To decide if we two sequences are likely to be homologues (related), we calculate the statistical significance of the alignment score • To do this, we first need a null model (background model), ie. a statistical model that will let us calculate what we expect There are many proteins in all the different species 2 randomly chosen proteins are expected to be unrelated Our null model should therefore describe the alignment scores expected for pairs of unrelated sequences
  • 10. • How can we know the alignment scores for pairs of unrelated protein sequences? We could generate random protein sequences, & calculate alignment scores for pairs of random protein sequences We can use a multinomial model to generate random protein sequences ie. make a roulette wheel with different fractions of the wheel labelled for each of the 20 amino acids Then spin thin wheel n times to make a random protein sequences that is n amino acids long In this multinomial model, p(P)=0.14, p(A)=0.28, p(W)=0.14, p(H)=0.14, p(E)=0.28 All the other amino acids have probabilities of zero here
  • 11. • A good multinomial model for random sequences should take in the sequence composition eg. we could use a multinomial model to generate random sequences of the same composition as C. elegans TRA2 ie. make a roulette wheel where the fraction of the circle labelled with each of the 20 amino acids is set equal to the % of that amino acid in the TRA2 sequence
  • 12. • One way to see if an alignment score is statistically significant is to compare it to the scores for alignments of random sequences We make a random sequence of the same length amino acid composition as one of our original 2 sequences (eg. TRA2) ie. use our ‘TRA2’ multinomial model to do make a sequence Alignment of human PTCH2 & a random sequence generated using a multinomial model (with the probabilities of amino acids set equal to their fractions in TRA2) (score = 51):
  • 13. • We can generate 200 random sequences using our TRA2-like multinomial model For each random sequence, we can calculate the best alignment score for the random sequence and human PTCH2
  • 14. Compare the scores obtained with the score seen for PTCH2 & TRA2 eg. Alignment score for Number of proteins PTCH2 & alignments TRA2 of random sequences Alignment score 5% of scores for alignments of random sequences What % of the random sequences have a score equal to or higher than that for TRA2 & PTCH2? eg. 0.95 in the picture This method can be used to estimate the significance of alignments in the form of P-values, eg. P=0.05 in the picture We accept the alignment as significant (indicating probable homology) if the score is in the top 5% (or another chosen value) of the scores for random sequences, ie. if P ≤ 0.05
  • 15. eg. for human PTCH2 and C. elegans TRA2: The alignment score is 136 When 200 random sequences (generated with a ‘TRA2’ multinomial model) were aligned to PTCH2, only 0.36% alignments had a score of ≥136 Therefore, we estimate a P-value of P=0.0036 ie. we estimate that the probability of getting a score of 136 for PTCH2 and TRA2 due to chance is 0.0036 (36/10,000) Alignment of human PTCH2 & C. elegans TRA2 (score = 136): Human PTCH2 and C. elegans tra-2 are probably homologues
  • 16. In the example below, 0.95 of the random sequences have an alignment score equal to or higher to that for A & B, so P=0.95 Alignment score for a Number of different A & B alignments of random sequences Alignment score 95% of scores for alignments of random sequences Alignment of fruitfly Eyeless & C. elegans TRA2 (score = 78): P=1 eyeless and tra-2 are probably not homologues
  • 17. Further Reading • Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn • Chapter 6 in Deonier et al book Computational Genome Analysis • Practical on alignment in R in the Little Book of R for Bioinformatics: https://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Editor's Notes

  1. Image credit (Williams sisters): http://media-2.web.britannica.com/eb-media/24/79824-004-7C20393C.jpg Image credit (Marlyn Monroe): http://cm1.theinsider.com/media/0/81/12/Marilyn-Monroe-11.0.0.0x0.432x594.jpeg
  2. Note: made image by aligning Uniprot PTCH2_HUMAN to C. elegans TRA2 protein (CE23546 from WormBase) using Ssearch (Smith-Waterman algorithm) and viewing the alignment using Jalview. These alignment contains the Patched domain. Image credit (human): http://www.ensembl.org/img/species/pic_Homo_sapiens.png Image credit (C. elegans): http://www.ensembl.org/img/species/pic_Caenorhabditis_elegans.png
  3. Note: made image by aligning Uniprot PTCH2_HUMAN to C. elegans TRA2 protein (CE23546 from WormBase) using Ssearch (Smith-Waterman algorithm) and viewing the alignment using Jalview. Image credit (human): http://www.ensembl.org/img/species/pic_Homo_sapiens.png Image credit (C. elegans): http://www.ensembl.org/img/species/pic_Caenorhabditis_elegans.png