Extracting information from
DNA sequences using
models of sequence evolution
Gavin Huttley
John Curtin School of Medical R...
.org

© 2013 Gavin Huttley

!2
Overview
•

Darwin meets Mendel meets …

•

Felsenstein applies Fisher

•

Markov processes of sequence evolution

•

Meas...
Darwin & Mendel —
Genetic change during
evolution

!4
Mendels underpinning
•

Population of 1 self-pollinating
plant

•

Dies at reproduction

•

Founder is A/G at one locus

•...
A/G
A/A

Gen 1

A/G G/G
0.25 0.5 0.25

Gen 2

1
P rob(not fixed gen 1) =
2
A/A A/G G/G
1 1
1
0.25 0.5 0.25 P rob(fixed gen 2...
Frequency (A)

Each line is a separate
population

Time

© 2013 Gavin Huttley

!7
Frequency (A)

Each line is a separate
population

Time

© 2013 Gavin Huttley

!7
“[Ne is] the number of individuals in a
theoretically ideal population having the same
magnitude of random genetic drift a...
RGD Summary
•

probability of fixation is just the allele frequency, which
initially is 1/(2Ne)	


•

the expected time to ...
Mutation

!10
µ
Number of new mutations = 2N µ
1
P rob(fixation of a new mutant) =
2N
1
Rate of fixation = 2N µ ⇥
=µ
2N
So population size...
The Neutral
Theory

© 2013 Gavin Huttley

!12
What does neutral mean?
•

That a genetic variant is ‘invisible’ to natural
selection. (Hence, selectively neutral.)

•

T...
•

Say only hydrophilic amino
acids are allowed at a
specific position

Q

C
•

In this case, the mutation
events that prod...
Kimura’s rule of thumb:
Natural selection only
effective against RGD when
4Ne s >> 1
!15
What happened in populations is
what we see between species, ie
polymorphism and substitution are
related.

!16
1
0.9
0.8

Frequency

0.7
0.6
0.5

Positive Natural Selection

0.4

Balancing Natural Selection

0.3

Neutral Evolution

0...
Summary
•

The Neutral Theory as the null hypothesis for
evolutionary analysis.

•

Darwin's filter as a basis for predicti...
Felsenstein applies
Fisher — likelihood for
phylogenetics

!19
“The likelihood, L(H|R), of the hypothesis H
given the data R, and a specific model, is
proportional to P(R|H), the constan...
Utility of phylogenetic
techniques

•

identify relationships among sequences

•

understand divergence mechanisms

© 2013...
The phylogenetic
“hypothesis”
•

The tree topology

•

Representation of sequence
divergence
•

Substitution matrices — P(...
Likelihood for 3 sequences
Unobserved
ancestral states

P (t1 )
P (t2 )
P (t3 )

© 2013 Gavin Huttley

For this alignment ...
Some Assumptions
•

edges are independent

•

the same tree holds for all nucleotides

•

+ assumptions of substitution mo...
Likelihood & consistency
•

Consistency (convergence of an estimate to the
true parameter value) occurs by addition of ali...
Markov processes

!26
Q for F81/HKY85/GTR
P (ti ) = exp
2

6rA$C ⇡A
Q=6
4rA$G ⇡A
rA$T ⇡A

rA$C ⇡C
rC$G ⇡C
rC$T ⇡C

Q i ti

rA$G ⇡G
rC$G ⇡G
rG$T ...
Typical assumptions
•

positions evolve iid

•

time homogeneous (embeddable)

•

reversible

•

stationary

© 2013 Gavin ...
Comparing models

!29
Comparing nested models
•

Tree topology must be the same between null and
alternate models

•

Processes in null and alte...
F81 vs HKY85

© 2013 Gavin Huttley

!31
The

© 2013 Gavin Huttley

2
χ

approximation

!32
The

© 2013 Gavin Huttley

2
χ

approximation

!32
The

© 2013 Gavin Huttley

2
χ

approximation

!32
Comparing non-nested
models
•

Also use a LR statistic

•

The probability of observing a LR statistic of equal
or greater...
Other comparison
approaches

•

information criterion (AIC, BIC)

•

estimates of a parameter of interest

© 2013 Gavin Hu...
Model choice
considerations

!35
•

Consistency meaningful for the true (generating)
model

•

Model specification is THE most important issue

•

Black-box...
I will show
•

Context dependent models warranted

•

For context dependent models, model specification
choices have profou...
DNA encodes information
in a context dependent
manner

!38
Encoding proteins with DNA
•

20 aa are encoded by triplets of
nucleotides (codons)

•

There are three special “stop”
cod...
Modelling codon evolution

•

Split alignments into nonoverlapping trinucleotides and
treat each such column as
evolving i...
Readily tested in protein
coding sequences
•

Nonsynonymous substitutions can be affected by
natural selection

•

Synonym...
pic from wikipedia

Phylogenetic Evidence for Frequent Positive Selection and Recombination
in the Meningococcal Surface A...
Modelling Contextual
influences

!43
Context dependent rate
matrices
•

Multi-position changes
disallowed

•

Definition of πe distinguishes
competing model for...
Two state alphabet {R,Y}
⎡
−
⎢
⎢ π (R)
QMG (κ ) =
NF
⎢ π (R)
⎢
0
⎢
⎣
⎡
−
⎢
⎢ π (R)π (R)
QTF (Κ) =
GY
⎢ π (R)π (R)
⎢
0
⎢
⎣
...
Simulated F81 AT-rich seqs
MG!
GY

Lindsay, H., Yap, V. B., Ying, H., & Huttley, G. A. (2008). Pitfalls
of the most common...
•

MG
•

•

•

π is multiplicative, meaning it’s the product of the monomer
frequencies*
to get to an independent (monomer...
The Conditional Nucleotide
Frequency (CNF) model
⎧
⎪
q(a,b) = ⎨
⎪
⎩

0
More than one diff.
r(a,b)π e Otherwise

Consider t...
Codon models
HKY form

qa,b

© 2013 Gavin Huttley

8
>0
>
>
>
> ⇥x
>
<
= ⇥x · ⇤
>
>
> ⇥x ·
>
>
>
:⇥ · · ⇤
x

more than 1 d...
Multiplicative

non-Multiplicative

AT-rich

AT≈GC

GC-rich

GY
© 2013 Gavin Huttley

MG
!50
Limits of Simulation

http://xkcd.com/221/
© 2013 Gavin Huttley

!51
GYtri,HKY

MGtri,GTR

CNFtri,GTR

© 2013 Gavin Huttley

!52
BUT does a model
explain the data well?

!53
How good are the models?
•

Comparing models by LR tests or AIC
•

•

rubbish against rubbish?

Do they explain the data w...
Summary
•

Multitude of evolutionary models published

•

Underlying rate matrices share assumptions that are not
well sup...
Hardip
Bob

Åsa

Yicheng

Jackie

Ben
Aaron

Cam

Steph

Funding from ARC,
NHMRC, BPA
© 2013 Gavin Huttley

Gavin
VB Yap (...
Upcoming SlideShare
Loading in …5
×

Extracting Information From DNA Sequences Using Models of Sequence Evolution - Gavin Huttley

543 views

Published on

Examination of biological sequences can reveal many aspects of living systems. For instance, analysis of DNA sequence variation can be used to assess the relationship between organisms, properties of DNA repair systems or the operation of Darwinian natural selection. Markov models of sequence evolution are central to these efforts. I will discuss Markov models of DNA sequence evolution and how they are used within a phylogenetic framework to dissect evolutionary dynamics.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
543
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Extracting Information From DNA Sequences Using Models of Sequence Evolution - Gavin Huttley

  1. 1. Extracting information from DNA sequences using models of sequence evolution Gavin Huttley John Curtin School of Medical Research Australian National University © 2013 Gavin Huttley !1
  2. 2. .org © 2013 Gavin Huttley !2
  3. 3. Overview • Darwin meets Mendel meets … • Felsenstein applies Fisher • Markov processes of sequence evolution • Measuring model support © 2013 Gavin Huttley !3
  4. 4. Darwin & Mendel — Genetic change during evolution !4
  5. 5. Mendels underpinning • Population of 1 self-pollinating plant • Dies at reproduction • Founder is A/G at one locus • What is the probability the second generation population is homozygous at this locus? http://www.flickr.com/photos/22281745@N04/2149169348/ © 2013 Gavin Huttley !5
  6. 6. A/G A/A Gen 1 A/G G/G 0.25 0.5 0.25 Gen 2 1 P rob(not fixed gen 1) = 2 A/A A/G G/G 1 1 1 0.25 0.5 0.25 P rob(fixed gen 2) = ⇥ = 2 2 4 1 1 1 P rob(fixed) = + = 4 4 2 The probability of being fixed by generation 2 is then 1 1 3 P rob(fixed by gen 2) = + = 2 4 4 © 2013 Gavin Huttley !6
  7. 7. Frequency (A) Each line is a separate population Time © 2013 Gavin Huttley !7
  8. 8. Frequency (A) Each line is a separate population Time © 2013 Gavin Huttley !7
  9. 9. “[Ne is] the number of individuals in a theoretically ideal population having the same magnitude of random genetic drift as the actual population. Hartl & Clark, 2007, p 121 !8
  10. 10. RGD Summary • probability of fixation is just the allele frequency, which initially is 1/(2Ne) • the expected time to fixation is 4Ne generations • Big populations have more variation than small ones • Future allele frequencies depend only on the current population frequency, not past frequencies © 2013 Gavin Huttley !9
  11. 11. Mutation !10
  12. 12. µ Number of new mutations = 2N µ 1 P rob(fixation of a new mutant) = 2N 1 Rate of fixation = 2N µ ⇥ =µ 2N So population size has no effect!! © 2013 Gavin Huttley !11
  13. 13. The Neutral Theory © 2013 Gavin Huttley !12
  14. 14. What does neutral mean? • That a genetic variant is ‘invisible’ to natural selection. (Hence, selectively neutral.) • The evolutionary dynamics (changes in frequency) are dictated by random genetic drift and mutation only. • “functionally less important molecules or parts of a molecule evolve faster than more important ones” © 2013 Gavin Huttley !13
  15. 15. • Say only hydrophilic amino acids are allowed at a specific position Q C • In this case, the mutation events that produce nonhydrophilic amino acids will be eliminated by natural selection, the fixation probability will be less and ditto for the substitution rate © 2013 Gavin Huttley !14 P D S K T U R G E H N
  16. 16. Kimura’s rule of thumb: Natural selection only effective against RGD when 4Ne s >> 1 !15
  17. 17. What happened in populations is what we see between species, ie polymorphism and substitution are related. !16
  18. 18. 1 0.9 0.8 Frequency 0.7 0.6 0.5 Positive Natural Selection 0.4 Balancing Natural Selection 0.3 Neutral Evolution 0.2 0.1 0 0 1 Negative Natural Selection 2 3 4 5 6 Time © 2013 Gavin Huttley !17 7 8 9 10 11
  19. 19. Summary • The Neutral Theory as the null hypothesis for evolutionary analysis. • Darwin's filter as a basis for predicting biological significance • • Slowly evolving is taken as evidence for functionally important In other words, inferring selection requires quantifying a neutral process © 2013 Gavin Huttley !18
  20. 20. Felsenstein applies Fisher — likelihood for phylogenetics !19
  21. 21. “The likelihood, L(H|R), of the hypothesis H given the data R, and a specific model, is proportional to P(R|H), the constant of proportionality being arbitrary.” –Edwards 1992 !20
  22. 22. Utility of phylogenetic techniques • identify relationships among sequences • understand divergence mechanisms © 2013 Gavin Huttley !21
  23. 23. The phylogenetic “hypothesis” • The tree topology • Representation of sequence divergence • Substitution matrices — P(n) in the figure — for each branch specifying probabilities of change from one sequence state to another • Ancestral state frequencies © 2013 Gavin Huttley !22
  24. 24. Likelihood for 3 sequences Unobserved ancestral states P (t1 ) P (t2 ) P (t3 ) © 2013 Gavin Huttley For this alignment column, the likelihood the ancestral base was A L(A) = A pA,A (t1 ) pA,G (t2 ) pA,C (t3 ) The full likelihood is L1 = LA + LG + LC + LT !23
  25. 25. Some Assumptions • edges are independent • the same tree holds for all nucleotides • + assumptions of substitution model © 2013 Gavin Huttley !24
  26. 26. Likelihood & consistency • Consistency (convergence of an estimate to the true parameter value) occurs by addition of aligned columns, ie longer alignments • JT Chang Full reconstruction of Markov models on evolutionary trees: Identifiability and consistency Mathematical Biosciences 137:1, 51-73 © 2013 Gavin Huttley !25
  27. 27. Markov processes !26
  28. 28. Q for F81/HKY85/GTR P (ti ) = exp 2 6rA$C ⇡A Q=6 4rA$G ⇡A rA$T ⇡A rA$C ⇡C rC$G ⇡C rC$T ⇡C Q i ti rA$G ⇡G rC$G ⇡G rG$T ⇡G ri$j – exchangeability term ⇡i – probability of base i © 2013 Gavin Huttley !27 3 rA$T ⇡T rC$T ⇡T 7 7 rG$T ⇡T 5
  29. 29. Typical assumptions • positions evolve iid • time homogeneous (embeddable) • reversible • stationary © 2013 Gavin Huttley !28
  30. 30. Comparing models !29
  31. 31. Comparing nested models • Tree topology must be the same between null and alternate models • Processes in null and alternate models must be nested, e.g. HKY and GTR • THEN (typically), the conventional likelihood ratio test can be employed and related to the χ2 distribution LR = 2(ln L1 © 2013 Gavin Huttley !30 ln L0 )
  32. 32. F81 vs HKY85 © 2013 Gavin Huttley !31
  33. 33. The © 2013 Gavin Huttley 2 χ approximation !32
  34. 34. The © 2013 Gavin Huttley 2 χ approximation !32
  35. 35. The © 2013 Gavin Huttley 2 χ approximation !32
  36. 36. Comparing non-nested models • Also use a LR statistic • The probability of observing a LR statistic of equal or greater value by chance under the null hypothesis is estimated using a parametric bootstrap procedure in which data are simulated under the fitted null model • © 2013 Gavin Huttley Goldman N. (1993) J. Mol. Evol. 36:2, 182-98 !33
  37. 37. Other comparison approaches • information criterion (AIC, BIC) • estimates of a parameter of interest © 2013 Gavin Huttley !34
  38. 38. Model choice considerations !35
  39. 39. • Consistency meaningful for the true (generating) model • Model specification is THE most important issue • Black-box model comparison procedures can support choices that are mechanistically invalid • Models should be mechanistically coherent, interpretable and explain the data well © 2013 Gavin Huttley !36
  40. 40. I will show • Context dependent models warranted • For context dependent models, model specification choices have profound consequences • Not just about number of parameters • One approach to an empirical check © 2013 Gavin Huttley !37
  41. 41. DNA encodes information in a context dependent manner !38
  42. 42. Encoding proteins with DNA • 20 aa are encoded by triplets of nucleotides (codons) • There are three special “stop” codons • changes to codons classified as • synonymous (syn) changes do not modify encoded aa • nonsynonymous (nsyn) changes do • nonsense changes create a stop codon © 2013 Gavin Huttley !39
  43. 43. Modelling codon evolution • Split alignments into nonoverlapping trinucleotides and treat each such column as evolving independently © 2013 Gavin Huttley !40
  44. 44. Readily tested in protein coding sequences • Nonsynonymous substitutions can be affected by natural selection • Synonymous substitutions do not modify the encoded amino acid and are presumed “neutral” • The rate ratio (nsyn/syn), termed ω, is taken as an indicator of the mode of natural selection © 2013 Gavin Huttley !41
  45. 45. pic from wikipedia Phylogenetic Evidence for Frequent Positive Selection and Recombination in the Meningococcal Surface Antigen PorB Rachel Urwin,* Edward C. Holmes,† Andrew J. Fox,‡ Jeremy P. Derrick,§ and Martin C. J. Maiden* *The Peter Medawar Building for Pathogen Research and Department of Zoology, University of Oxford; †Department of Zoology, University of Oxford; ‡Meningococcus Reference Unit, Public Health Laboratory, Withington Hospital, Manchester; and §Department of Biomolecular Sciences, University of Manchester Institute of Science and Technology and thousands more ature Publishing Group Previous estimates of rates of synonymous (dS) and nonsynonymous (dN) substitution among Neisseria meningitidis gene sequences suggested that the surface loops of the variable outer membrane protein PorB were under only weak selection pressure from the host immune response. These findings were consistent with studies indicating that PorB variants were not always protective in immunological and microbiological assays and questioned the suitability of © 2013 this protein as a vaccine component. PorB, which is expressed at high levels on!42 surface of the meningococcus, Gavin Huttley the
  46. 46. Modelling Contextual influences !43
  47. 47. Context dependent rate matrices • Multi-position changes disallowed • Definition of πe distinguishes competing model forms • • MG frequency of ending base • ⎧ q(a,b) = ⎪ ⎨ ⎪ ⎩ GY frequency of ending tuple CNF conditional frequency of ending base © 2013 Gavin Huttley !44 0 More than one diff. r(a,b)π e Otherwise
  48. 48. Two state alphabet {R,Y} ⎡ − ⎢ ⎢ π (R) QMG (κ ) = NF ⎢ π (R) ⎢ 0 ⎢ ⎣ ⎡ − ⎢ ⎢ π (R)π (R) QTF (Κ) = GY ⎢ π (R)π (R) ⎢ 0 ⎢ ⎣ NB: GY here has multiplicative form © 2013 Gavin Huttley π (Y ) π (Y ) 0 − 0 κπ (Y ) 0 − κπ (Y ) κπ (R) κπ (R) − ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ π (R)π (Y ) π (Y )π (R) 0 − 0 Κπ (Y )π (Y ) 0 − Κπ (Y )π (Y ) Κπ (R)π (Y ) Κπ (Y )π (R) − κπ (R) Κ= π (Y ) !45 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
  49. 49. Simulated F81 AT-rich seqs MG! GY Lindsay, H., Yap, V. B., Ying, H., & Huttley, G. A. (2008). Pitfalls of the most commonly used models of context dependent substitution. Biol Direct, 3, 52. © 2013 Gavin Huttley !46
  50. 50. • MG • • • π is multiplicative, meaning it’s the product of the monomer frequencies* to get to an independent (monomer) processes you remove context parameters GY • π is not multiplicative. • π is a more realistic representation of tuple frequencies in real data • to get to an independent (monomer) processes you add context parameters © 2013 Gavin Huttley !47
  51. 51. The Conditional Nucleotide Frequency (CNF) model ⎧ ⎪ q(a,b) = ⎨ ⎪ ⎩ 0 More than one diff. r(a,b)π e Otherwise Consider the exchange AAA → ATA CNF: !e is the conditional probability of T given 5’-A•A-3’. Yap, V. B., Lindsay, H., Easteal, S., & Huttley, G. (2010). Estimates of the effect of natural selection on protein coding content. Molecular Biology and Evolution, 27(3), 726-34. © 2013 Gavin Huttley !48
  52. 52. Codon models HKY form qa,b © 2013 Gavin Huttley 8 >0 > > > > ⇥x > < = ⇥x · ⇤ > > > ⇥x · > > > :⇥ · · ⇤ x more than 1 di erence synonymous transversion nonsynonymous transversion synonymous transition nonsynonymous transition !49
  53. 53. Multiplicative non-Multiplicative AT-rich AT≈GC GC-rich GY © 2013 Gavin Huttley MG !50
  54. 54. Limits of Simulation http://xkcd.com/221/ © 2013 Gavin Huttley !51
  55. 55. GYtri,HKY MGtri,GTR CNFtri,GTR © 2013 Gavin Huttley !52
  56. 56. BUT does a model explain the data well? !53
  57. 57. How good are the models? • Comparing models by LR tests or AIC • • rubbish against rubbish? Do they explain the data well?! • How can we evaluate this? • • © 2013 Gavin Huttley G-statistic (expecteds computed from MLEs) Comparison with Goldman’s best likelihood !54
  58. 58. Summary • Multitude of evolutionary models published • Underlying rate matrices share assumptions that are not well supported • Model comparison approaches amongst these do not address fundamental issue of how well the data are described by the model • Metrics of tree support meaningful only if model explains data well • More on this by Dr Ben Kaehler on friday © 2013 Gavin Huttley !55
  59. 59. Hardip Bob Åsa Yicheng Jackie Ben Aaron Cam Steph Funding from ARC, NHMRC, BPA © 2013 Gavin Huttley Gavin VB Yap (Nat. Uni. Singapore), H Lindsay (ETS Zurich), H Ying (CSIRO) Peter Maxwell (NZ)

×