Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches.
1. Allele frequencies as Stochastic Processes
Mathematical and Statistical Approaches
Gota Morota
Nov 30, 2010
1 / 32
2. Outline
Change of Allele Frequencies as Stochastic Processes
Steady State Distributions of Allele Frequencies
Time Series Analysis
2 / 32
3. Outline
Change of Allele Frequencies as Stochastic Processes
Steady State Distributions of Allele Frequencies
Time Series Analysis
3 / 32
4. Outline
Change of Allele Frequencies as Stochastic Processes
Steady State Distributions of Allele Frequencies
Time Series Analysis
4 / 32
5. Various factors affecting allele frequencies
• Selection, mutation and migration (cross breedings) ⇒
systematic pressures (Wright 1949)
• Random fluctuations
1. Random sampling of gametes (genetic drift)
2. Random fluctuation in systematic pressures
⇓
Allele frequencies are funcions of the systematic forces and the
random components
5 / 32
6. Random walk ⇒ Brownian Motion
0.10
−0.010
0.05
−0.015
0.00
−0.020
−0.05
−0.025
−0.10
−0.030
−0.15
−0.20
−0.035
−0.25
−0.040
200
2
4
6
8
400
10
600
800
1000
Time
Time
Figure 3: Time = [1:1000]
Figure 1: Time = [1,10]
0.8
−0.02
0.6
−0.04
0.4
−0.06
0.2
−0.08
0.0
−0.10
−0.2
2000
20
40
60
80
100
Time
Figure 2: Time = [1:100]
4000
6000
8000
10000
Time
Figure 4: Time = [1:10000]
6 / 32
7. Brownian Motion ⇒ Diffusion Model
0.8
0.6
+ conditional on
forces
0.4
0.2
Systematic
0.0
−0.2
2000
4000
6000
8000
10000
Time
Figure 5: Time = [1:10000]
• treat change of allele frequencies as stochastic porcess
⇓
Diffusion Model
7 / 32
8. Diffusion Model
Allele Frequency
It frames infinite number of paths that allele fequencies would take
over time under certain systematic pressures.
0
2000
4000
6000
8000
10000
6000
8000
10000
6000
8000
10000
Allele Frequency
Time
0
2000
4000
Allele Frequency
Time
0
2000
4000
Time
• pick up single time
point t (say 5000 in
above)
• try to find PDF at
point t
• need to solve partial differntial
equation (PDE)
• Fokker-Planck Equation!
8 / 32
9. Fokker-Planck Equation
• Derived from a continuous time stochastic process (X)
• Partial differential equation
∂
∂φ(p , x ; t ) 1 ∂2
{Vδx φ(p , x ; t )} −
=
{Mδx φ(p , x ; t )}
2
∂t
2 ∂x
∂x
(1)
where
• p: initial allele frequency (fixed)
• x: allele frequency (random variable)
• t: time (continuous variable)
• φ(p , x ; t ): PDF
• Vδx : variance of δx (amount of change in allele frequency per
time)
• Mδx : mean of δx (amount of change in allele frequency per
time)
• Vδx and Mδx : both may depend on x and t
9 / 32
10. Fokker-Planck Equation for Brownian Motion
A standard Brownian motion can be constructed from random walk
with error having mean 0 and variance 1 under right scaling. It has
the PDF of N(0, t).
• when t = 1.0, N(0, 1)
• when t = 1.5, N(0, 1.5)
Fokker-Planck equation:
∂φ(p , x ; t ) 1 ∂2
=
φ(p , x ; t )
∂t
2 ∂x 2
= Heat equation
(2)
(3)
Mδx = 0 and Vδx = 1 in equation (1)
Solution:
φ(p .x ; t ) = √
1
2πt
exp
−x 2
2t
(4)
10 / 32
11. Solution of the Heat Equation (the Heat Kernel)
t = 0.00001
t = 0.01
t=0.1
t=1
t=10
−2
−1
0
1
2
x
11 / 32
12. Under Random Genetic Drift
Mδx = 0
Vδx =
x (1 − x )
2Ne
Fokker-Planck equation for random genetic drift:
∂φ(p , x ; t )
1 ∂2
x (1 − x )φ(p , x ; t )
=
∂t
4Ne ∂x 2
(5)
Solutions are obtained as infinite series of sum by...
• Kimura (1955) Hypergeometric function
• Korn and Korn (1968) Gegenbauer polynomial
φ = 6p (1 − p )exp
−1
2Ne
t + 30p (1 − p )(1 − 2p )(1 − 2x )
−3
2Ne
t + ··· ,
12 / 32
13. Solution of FPE (Kimura 1955)
VOL. 41) 1955
GENETICS: MOTOO KIMURA
149
FIGS. 1-2.-The processes of the change in the probability distribution of heterallelic classes,
due to random sampling of gametes in reproduction. It is assumed that the population starts
from the gene frequency 0.5 in Fig. 1 (left) and 0.1 in Fig. 2 (right). t = time in generation; N = effective size of the population; abscissa is gene frequency; ordinate is probability
density.
13 / 32
14. Under Selection and Random Genetic Drift
Mδx = sx (1 − x )
Vδx =
x (1 − x )
2Ne
∂
1 ∂2
∂φ(p , x ; t )
x (1 − x )φ(p , x ; t ) − s x (1 − x )φ(p , x ; t ) (6)
=
∂t
4Ne ∂x 2
∂x
Solutions are obtained as infinite series using oblate spheroidal
equation using transformaton of allele frequencies (z = 1-2x)
• Kimura (1955)
• Kimura and Crow (1956)
∞
(1)
φ(p , x , t ) =
k =0
Ck exp (−λk t + 2cx )V1k (z )
(7)
where
(1)
V1k (z ) =
k 1
fn Tn (z )
n=0,1
14 / 32
15. Kolmogorov Backward Equation
• Derived from a continuous time stochastic process (P)
• Partial differential equation
∂
∂2
∂φ(p , x ; t ) 1
= Vδp 2 φ(p , x ; t ) + Mδp φ(p , x ; t )
∂t
2
∂p
∂p
(8)
where
• p: initial allele frequency (random variable)
• x: allele frequency (random variable except x in the time t is
fixed)
• t: time (continuous variable)
• φ(p , x ; t ): PDF
• Vδp : variance of δp (amount of change in allele frequency)
• Mδp : mean of δp (amount of change in allele frequency)
• Vδp and Mδp : both may depend on x but not on t (time
homogeneous)
15 / 32
16. Steady State Distribution of Allele Frequencies
Equilibrium
• single point (balance between various forces that keep allele
frequecies near equilibrium )
• PDF
⇓
PDF of stable equilibrium instead of single point
Steady state allele frequency distribution
• Fisher (1922), (1930)
• Wright (1931), (1937), (1938)
φ(p , x ; t ) = solution of a fokker-planck equation
lim φ(p , x ; t ) = φ(x )
(10)
t →∞
φ(x ) =
C
exp (2
V δx
(9)
M δx
dx )
Vδx
(11)
16 / 32
17. Steady State Distribution – Random Genetic Drift
For a large value of t, only the first few terms have impact on
determining the actual form of the PDF.
φ = 6p (1 − p )exp
−t
2Ne
+ 30p (1 − p )(1 − 2p )(1 − 2x )
−3t
2Ne
+ ··· ,
Asymptotic formula:
lim φ = C · exp
t →∞
−1
2Ne
t
17 / 32
18. is large can be found directly from the Poisson series according to which
the chance of drawing 0 where m is the mean number in a sample i s r m .
The contribution to the 0 class will thus be (e-1+e-2+e-3 . . .)f =
e-l
f , = 0.582f.
1-e-l
Graphical Representation (Wright 1931)
T
25%
50%
754,
Factor Frequ e nc y
FIGURE
3.-Distribution of gene frequencies in an isolated population in which fixation and
loss of genes each is proceeding at the rate 1/4N in the absence of appreciable selection or muta-
18 / 32
19. Steady State Distribution – Selection and Mutation
Mδx = −ux + v (1 − x ) +
¯
x (1 − x ) d a
2
dx
Vδx =
x (1 − x )
2Ne
¯
φ(x ) = C · exp (2Ne a )x 4Ne v −1 (1 − x )4Ne u−1
(12)
When A has selecive advantage s over a:
¯
a = 2sx 2 + s2x (1 − x ) + 0 ∗ (1 − x 2 )
= 2sx
φ(x ) = C · exp (4Ne sx )x 4Ne v −1 (1 − x )4Ne u−1
(13)
19 / 32
21. Time Series Analysis
When variable is measured sequentially in time resulting data form
a time series.
• Diffusion Model – Continuous time stochastic process
• Time Series – Discrete time stochastic process
21 / 32
22. Basic Models
Observations close together in time tend to be correlated
• Autoregressive Model: AR(p)
p
Xt = c +
ψi Xt −i +
t
(14)
i =1
• Moving Average Model: MA(q)
q
Xt = c +
θi
t −i
+
t
(15)
i =1
• Autoregressive Moving Average Model: ARMA (p, q)
Xt = AR(p) + MA(q)
(16)
22 / 32
23. Time Series as a Polynomial Equation
B k Xt = Xt −k (back shift operator)
• AR(p)
Xt = ψ1 Xt −1 + · · · + ψp Xt −p
Xt = (ψ1 B + · · · + ψp B p )Xt
(1 − ψ1 B − · · · − ψp B p )Xt = 0
• ARMA(p,q)
Xt − ψ1 Xt −1 − · · · − ψp Xt −p =
t
+ θ1
t −1
+ · · · + θq
t −q
(1 − ψ1 B − · · · − ψp B )Xt = (1 + θ1 B + · · · + θq B q )
p
t
23 / 32
24. Stationary Process
The mean and variance do not change over time. No trend.
Not stationary
Looks like stationary
10
0.8
0.6
5
0.4
0
0.2
−5
0.0
−0.2
−10
2000
4000
6000
8000
10000
2000
4000
6000
8000
Time
Figure 6: Random Walk
10000
Time
Figure 7: Detrended
Detrending:
• linear regression
• take a difference
• Autoregressive Integrated Moving Average: ARIMA(p,d,q)
24 / 32
25. Application on Allele Frequencies
• Influential SNPs – indicative of deterministic trends
• Uninfluential SNPs – random fluctuation?
• Diffusion Model – assumed Markovian process
• Time Series – which model describes the process of change
of allele frequencies
Application
• Objective: model process of change of allele freqeuncies
• Data: SNPs genotypes of 4,798 Holstein bulls with 38,416
markers and milk yield
• Genotype inputation: FastPhase 1.4
• Estimation of marker effects: BayesCπ
25 / 32
26. BayesCπ
Analysis of human mini-exome sequencing data using a Bayesian hierarchical mixture
model: Genetic Analysis Workshop 17
Bueno Filho JS1,2∗ , Morota G1∗ , Tran QT3 , Maenner MJ4 , Vera-Cala LM4,5 , Engelman CD4§ , and Meyers KJ4§
Department of Dairy Science, University of Wisconsin-Madison, USA
Departamento de Ciˆncias Exatas, Universidade Federal de Lavras, Brasil
e
3
Department of Statistics, University of Wisconsin-Madison, USA
4
Department of Population Health Sciences, University of Wisconsin-Madison, USA
5
Departamento de Salud Publica, Universidad Industrial de Santander, Colombia
1
2
∗
§
Contributed equally to this work
Corresponding author
Email addresses:
JSB: jssbueno@dex.ufla.br
Figure
GM: morota@wisc.edu
QTT: tran@stat.wisc.edu
MJM: maenner@waisman.wisc.edu
LMV: veracala@wisc.edu
CDE: cengelman@wisc.edu
KJM: kjmeyers2@wisc.edu
8: GAW17
26 / 32
27. Allele Frequency of the Top Marker
0.8
0.6
0.4
Allele Frequency
Original
0
5
10
15
20
25
30
25
30
Time
0.15
0.00
−0.15
Allele Frequency
Detrended
5
10
15
20
Time
Figure 9: Time plots of allele frequencies. Top: Original series. Bottom:
Smoothed by taking the first order difference.
27 / 32
28. Autocorrelation and Partial Autocorrelation
ARIMA(1,1,1)?
Original series
0.2
−0.4
−0.2
0.0
Partial ACF
0.4
0.0
ACF
0.8
0.4
Original series
0
2
4
6
8
10
12
14
2
4
6
8
10
12
First order difference series
14
Lag
First order ifference series
0.2
0.0
Partial ACF
−0.4
−0.2
0.4
0.0
−0.4
ACF
0.8
0.4
Lag
0
2
4
6
8
Lag
10
12
14
2
4
6
8
10
12
14
Lag
Figure 10: ACF and PACF
28 / 32
29. Model Selection
Table 1: Comparison of several competitive models
Model
ARIMA (1,0,0)
ARIMA (0,1,0)
ARIMA (0,0,1)
AIC
-51.56
-49.38
-46.41
Model
ARIMA (1,1,0)
ARIMA (1,0,1)
ARIMA (1,1,1)
AIC
-52.47
-51.13
-51.02
ARIMA(1,1,0)
Xt = 0.635Xt −1 +
t
29 / 32