This document describes using a Beta approximation to model the Wright-Fisher model of genetic drift in population genetics. It discusses using a moment-based approach to calculate the mean and variance of allele frequencies over time, allowing the distribution to be approximated by a Beta distribution. It also describes adding "spikes" to the Beta distribution to better model loss and fixation probabilities at the boundaries of 0 and 1.
1. using an accurate beta approximation
PAULA TATARU
THOMAS BATAILLON
ASGER HOBOLTH
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
CSHL, April 15th 2015
Inference under the Wright-Fisher model
2. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Theoretical population genetics
2
3. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Theoretical population genetics
›Mathematical models formalize the evolution of
genetic variation within and between populations
2
4. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Theoretical population genetics
›Mathematical models formalize the evolution of
genetic variation within and between populations
›Provide a framework for inferring evolutionary paths
from observed data to
2
5. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Inference problems
›Inference of population history from DNA data
› (Variable) population size
› Migration / admixture
› Divergence times
› Selection coefficients
3
6. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Inference problems: population size
4
H. Li and R. Durbin. Inference of human population history from individual whole-genome
sequences. Nature, 475:493–496, 2011
PSMC
7. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Inference problems: populations divergence
5
M. Gautier and R. Vitalis. Inferring population histories using genome-wide allele frequency data.
Molecular biology and evolution, 30(3):654–668, 2013
Kim Tree
8. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Inference problems: populations admixture
6
J. K. Pickrell and J. K. Pritchard. Inference of population splits and mixtures from genome-wide allele
frequency data. PLOS Genetics, 8(11):e1002967, 2012
TreeMix
9. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Inference problems: populations admixture
7
Gronau I., Hubisz M. J., Gulko B., Danko C. G., Siepel A. Bayesian inference of ancient human
demography from individual genome sequences. Nature genetics 43(10): 1031-1034, 2011
G-PhoCS
10. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Inference problems: loci under selection
8
Steinrücken M., Bhaskar A. and Song Y. S. A novel spectral method for inferring general selection from
time series genetic data. The Annals of Applied Statistics 8(4):2203–2222, 2014
spectralHMM
11. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: the Wright-Fisher model
› Evolution of a population
forward in time
› Follow one locus (region
in the DNA)
› Different variants at the
locus are called alleles
9
individuals
generations(time)
12. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: the Wright-Fisher model
› Basic model: only two
alleles per locus
› Follow the frequency of
one of the alleles
10
individuals
generations(time)
3
2
3
3
4
5
5
allele count
13. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Allele frequency distribution
11
14. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: the coalescent model
› Trace the genealogy of
sampled individuals
backward in time
12
individuals
generations(time)
15. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: the coalescent model
› Trace the genealogy of
sampled individuals
backward in time
12
individuals
generations(time)
16. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: the coalescent model
› Trace the genealogy of
sampled individuals
backward in time
12
individuals
generations(time)
MRCA
17. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: the coalescent model
› Trace the genealogy of
sampled individuals
backward in time
› Coalescent process
terminates when
reaching MRCA
12
individuals
generations(time)
MRCA
18. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›The Wright-Fisher ›The coalescent
Two dual models
13
19. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›The Wright-Fisher
› Forward in time
›The coalescent
› Backward in time
Two dual models
13
20. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›The Wright-Fisher
› Forward in time
› Follow allele frequency
›The coalescent
› Backward in time
› Follow genealogy
Two dual models
13
21. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›The Wright-Fisher
› Forward in time
› Follow allele frequency
› Selection
›The coalescent
› Backward in time
› Follow genealogy
› Recombination
Two dual models
13
22. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›The Wright-Fisher
› Forward in time
› Follow allele frequency
› Selection
› Scalability
›Sample size decreases
uncertainty
›The coalescent
› Backward in time
› Follow genealogy
› Recombination
› Scalability
›Sample size increases
complexity
Two dual models
13
23. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›Diffusion ›Moment-based
Approximations to the Wright-Fisher
14
24. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›Diffusion
› Large population size
› Infinitesimal change
›Moment-based
Approximations to the Wright-Fisher
14
25. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›Diffusion
› Large population size
› Infinitesimal change
›Moment-based
› Convenient distributions
› Normal distribution
› Beta distribution
Approximations to the Wright-Fisher
14
26. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›Diffusion
› Large population size
› Infinitesimal change
› No closed solution
› Cumbersome to evaluate
›Moment-based
› Convenient distributions
› Normal distribution
› Beta distribution
› Closed analytical forms
› Fast to evaluate
Approximations to the Wright-Fisher
14
27. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›Diffusion
› Large population size
› Infinitesimal change
› No closed solution
› Cumbersome to evaluate
›Moment-based
› Convenient distributions
› Normal distribution
› Beta distribution
› Closed analytical forms
› Fast to evaluate
› Problematic at boundaries
Approximations to the Wright-Fisher
14
28. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›Normal distribution ›Beta distribution
Behavior at the boundaries
15
29. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›Normal distribution
› Support: real line
›Beta distribution
› Support: [0, 1]
Behavior at the boundaries
15
30. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›Normal distribution
› Support: real line
› Truncation
›Incorrect variance
›Beta distribution
› Support: [0, 1]
Behavior at the boundaries
15
31. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
›Normal distribution
› Support: real line
› Truncation
›Incorrect variance
› Intermediary frequencies
›Beta distribution
› Support: [0, 1]
› Intermediary frequencies
Behavior at the boundaries
15
32. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Beta with spikes
›Use of Wright-Fisher
› Scalable
›Use of moments
› Simple mathematical calculations
›Improve behavior at boundaries
› Preserve mean and variance
16
33. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Wright Fisher model
› Zt allele count
› Xt = Zt /2N
› Zt+1 follows a binomial
distribution
17
individuals
generations(time)
3
2
3
3
4
5
5
allele count
34. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Wright Fisher model
› Zt allele count
› Xt = Zt /2N
› Zt+1 follows a binomial
distribution
17
individuals
generations(time)
3
2
3
3
4
5
5
allele count
35. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Wright Fisher model
› Zt allele count
› Xt = Zt /2N
› Zt+1 follows a binomial
distribution
› g encodes the
evolutionary pressures
17
individuals
generations(time)
3
2
3
3
4
5
5
allele count
36. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Wright Fisher model: Drift only
18
individuals
generations(time)
3
2
3
3
4
5
5
allele count
37. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Wright Fisher model: Mutations
19
individuals
generations(time)
3
2
4
5
4
3
2
allele count
u v
38. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Wright Fisher model: Mutations
19
individuals
generations(time)
3
2
4
5
4
3
2
allele count
u v
39. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Wright Fisher model: Migration
20
individuals
generations(time)
3
2
3
5
4
2
3
allele count
m1 m2
40. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Wright Fisher model: Migration
20
individuals
generations(time)
3
2
3
5
4
2
3
allele count
m1 m2
41. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Wright Fisher model: Linear forces
›Mutations
›Migration
›Mutations & Migration
21
42. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Wright Fisher model: Linear forces
›Mutations
›Migration
›Mutations & Migration
21
43. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
22
The Beta approximation: Main idea
›The density of Xt
44. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
22
The Beta approximation: Main idea
›The density of Xt
›Use recursive approach to calculate
› Mean and variance
45. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
22
The Beta approximation: Main idea
›The density of Xt
›Use recursive approach to calculate
› Mean and variance
46. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
23
The Beta approximation: Drift only
47. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
23
The Beta approximation: Drift only
48. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
24
The Beta approximation: Drift only
49. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
25
The Beta approximation: Drift only
50. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Beta with spikes: Main idea
›The density of Xt
26
51. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Beta with spikes: Main idea
›The density of Xt
›Use recursive approach to calculate
› Loss and fixation probabilities
26
52. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Beta with spikes: loss probability
27
53. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Beta with spikes: loss probability
28
54. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Beta with spikes: loss probability
28
55. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Beta with spikes: loss probability
28
56. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The Beta with spikes: fixation probability
29
57. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
30
The Beta with spikes: Drift only
58. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
30
The Beta with spikes: Drift only
59. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
31
The Beta with spikes: Drift only
60. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
32
The Beta with spikes: Drift only
61. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Numerical accuracy: Drift only
33
Beta Beta with spikes
62. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
34
Inference of divergence times: Drift only
›Simulated data
› 5000 independent loci
› 100 samples in each population
› 50 data sets (replicates)
63. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
34
Inference of divergence times: Drift only
›Simulated data
› 5000 independent loci
› 100 samples in each population
› 50 data sets (replicates)
›Allele frequency distribution is used to
calculate likelihood of data
›Likelihood is numerically optimized
64. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Inference of divergence times: Drift only
35
65. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Conclusions
›Beta with spikes
36
66. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Conclusions
›Beta with spikes
› An extension built on the beta approximation
36
67. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Conclusions
›Beta with spikes
› An extension built on the beta approximation
› Improves the quality of the approximation
36
68. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Conclusions
›Beta with spikes
› An extension built on the beta approximation
› Improves the quality of the approximation
› Simple mathematical formulation
36
69. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Conclusions
›Beta with spikes
› An extension built on the beta approximation
› Improves the quality of the approximation
› Simple mathematical formulation
› Works under linear evolutionary forces
36
70. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Conclusions
›Beta with spikes
› An extension built on the beta approximation
› Improves the quality of the approximation
› Simple mathematical formulation
› Works under linear evolutionary forces
› Comparable to state of the art methods
for inference of divergence times
36
71. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Conclusions
›Beta with spikes
› An extension built on the beta approximation
› Improves the quality of the approximation
› Simple mathematical formulation
› Works under linear evolutionary forces
› Comparable to state of the art methods
for inference of divergence times
› Recursive formulation enables incorporation
of variable population size
36
72. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Future work
›Incorporate selection
37
73. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Future work
›Incorporate selection
› Non-linear evolutionary force
37
74. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Future work
›Incorporate selection
› Non-linear evolutionary force
› Positive selection increases probability of fixation
37
75. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Future work
›Incorporate selection
› Non-linear evolutionary force
› Positive selection increases probability of fixation
› Mean and variance are no longer available in closed form
37
76. An accurate Beta approximation
Paula Tataru paula@birc.au.dk
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Future work
›Incorporate selection
› Non-linear evolutionary force
› Positive selection increases probability of fixation
› Mean and variance are no longer available in closed form
› Extend the approximation for loss/fixation probabilities to
mean and variance
37