Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Allele frequencies as Stochastic Processes
Mathematical and Statistical Approaches

Gota Morota

Nov 30, 2010

1 / 32
Outline

Change of Allele Frequencies as Stochastic Processes

Steady State Distributions of Allele Frequencies

Time Seri...
Outline

Change of Allele Frequencies as Stochastic Processes

Steady State Distributions of Allele Frequencies

Time Seri...
Outline

Change of Allele Frequencies as Stochastic Processes

Steady State Distributions of Allele Frequencies

Time Seri...
Various factors affecting allele frequencies

• Selection, mutation and migration (cross breedings) ⇒
systematic pressures...
Random walk ⇒ Brownian Motion
0.10
−0.010

0.05
−0.015
0.00
−0.020
−0.05

−0.025
−0.10

−0.030

−0.15

−0.20

−0.035

−0.2...
Brownian Motion ⇒ Diffusion Model

0.8

0.6

+ conditional on
forces

0.4

0.2

Systematic

0.0

−0.2

2000

4000

6000

8...
Diffusion Model

Allele Frequency

It frames infinite number of paths that allele fequencies would take
over time under cer...
Fokker-Planck Equation
• Derived from a continuous time stochastic process (X)
• Partial differential equation

∂
∂φ(p , x...
Fokker-Planck Equation for Brownian Motion
A standard Brownian motion can be constructed from random walk
with error havin...
Solution of the Heat Equation (the Heat Kernel)

t = 0.00001
t = 0.01
t=0.1
t=1
t=10

−2

−1

0

1

2

x

11 / 32
Under Random Genetic Drift
Mδx = 0

Vδx =

x (1 − x )
2Ne

Fokker-Planck equation for random genetic drift:

∂φ(p , x ; t ...
Solution of FPE (Kimura 1955)
VOL. 41) 1955

GENETICS: MOTOO KIMURA

149

FIGS. 1-2.-The processes of the change in the pr...
Under Selection and Random Genetic Drift
Mδx = sx (1 − x )

Vδx =

x (1 − x )
2Ne

∂
1 ∂2
∂φ(p , x ; t )
x (1 − x )φ(p , x...
Kolmogorov Backward Equation
• Derived from a continuous time stochastic process (P)
• Partial differential equation

∂
∂2...
Steady State Distribution of Allele Frequencies
Equilibrium
• single point (balance between various forces that keep allel...
Steady State Distribution – Random Genetic Drift

For a large value of t, only the first few terms have impact on
determini...
is large can be found directly from the Poisson series according to which
the chance of drawing 0 where m is the mean numb...
Steady State Distribution – Selection and Mutation

Mδx = −ux + v (1 − x ) +

¯
x (1 − x ) d a
2
dx

Vδx =

x (1 − x )
2Ne...
Graphical Representation (Wright 1937)
GENETICS: S. WRIGHT
308

PROC. N. A. S.

Fig.l

Fig 4

Fi9.2

Fig. 5

Fig. 6
20 / 3...
Time Series Analysis

When variable is measured sequentially in time resulting data form
a time series.
• Diffusion Model ...
Basic Models
Observations close together in time tend to be correlated
• Autoregressive Model: AR(p)
p

Xt = c +

ψi Xt −i...
Time Series as a Polynomial Equation
B k Xt = Xt −k (back shift operator)
• AR(p)

Xt = ψ1 Xt −1 + · · · + ψp Xt −p
Xt = (...
Stationary Process
The mean and variance do not change over time. No trend.
Not stationary

Looks like stationary
10

0.8
...
Application on Allele Frequencies
• Influential SNPs – indicative of deterministic trends
• Uninfluential SNPs – random fluct...
BayesCπ

Analysis of human mini-exome sequencing data using a Bayesian hierarchical mixture
model: Genetic Analysis Worksh...
Allele Frequency of the Top Marker

0.8
0.6
0.4

Allele Frequency

Original

0

5

10

15

20

25

30

25

30

Time

0.15
...
Autocorrelation and Partial Autocorrelation
ARIMA(1,1,1)?
Original series

0.2

−0.4

−0.2

0.0

Partial ACF

0.4
0.0

ACF...
Model Selection

Table 1: Comparison of several competitive models

Model
ARIMA (1,0,0)
ARIMA (0,1,0)
ARIMA (0,0,1)

AIC
-...
Advanced Models

Time dependent variance
• ARCH (Autoregressive Conditional Heteroskedasticity)
• GARCH (Generalized Autor...
Intersection of Mathematics and Statistics

Under certain condition
GARCH(1,1) ≈ Diffusion Model!

31 / 32
Thank you!

32 / 32
Upcoming SlideShare
Loading in …5
×

Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches.

519 views

Published on

Presented at Animal Breeding & Genomics Seminar. University of Wisconsin-Madison.

Published in: Technology, Education
  • Be the first to like this

Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approaches.

  1. 1. Allele frequencies as Stochastic Processes Mathematical and Statistical Approaches Gota Morota Nov 30, 2010 1 / 32
  2. 2. Outline Change of Allele Frequencies as Stochastic Processes Steady State Distributions of Allele Frequencies Time Series Analysis 2 / 32
  3. 3. Outline Change of Allele Frequencies as Stochastic Processes Steady State Distributions of Allele Frequencies Time Series Analysis 3 / 32
  4. 4. Outline Change of Allele Frequencies as Stochastic Processes Steady State Distributions of Allele Frequencies Time Series Analysis 4 / 32
  5. 5. Various factors affecting allele frequencies • Selection, mutation and migration (cross breedings) ⇒ systematic pressures (Wright 1949) • Random fluctuations 1. Random sampling of gametes (genetic drift) 2. Random fluctuation in systematic pressures ⇓ Allele frequencies are funcions of the systematic forces and the random components 5 / 32
  6. 6. Random walk ⇒ Brownian Motion 0.10 −0.010 0.05 −0.015 0.00 −0.020 −0.05 −0.025 −0.10 −0.030 −0.15 −0.20 −0.035 −0.25 −0.040 200 2 4 6 8 400 10 600 800 1000 Time Time Figure 3: Time = [1:1000] Figure 1: Time = [1,10] 0.8 −0.02 0.6 −0.04 0.4 −0.06 0.2 −0.08 0.0 −0.10 −0.2 2000 20 40 60 80 100 Time Figure 2: Time = [1:100] 4000 6000 8000 10000 Time Figure 4: Time = [1:10000] 6 / 32
  7. 7. Brownian Motion ⇒ Diffusion Model 0.8 0.6 + conditional on forces 0.4 0.2 Systematic 0.0 −0.2 2000 4000 6000 8000 10000 Time Figure 5: Time = [1:10000] • treat change of allele frequencies as stochastic porcess ⇓ Diffusion Model 7 / 32
  8. 8. Diffusion Model Allele Frequency It frames infinite number of paths that allele fequencies would take over time under certain systematic pressures. 0 2000 4000 6000 8000 10000 6000 8000 10000 6000 8000 10000 Allele Frequency Time 0 2000 4000 Allele Frequency Time 0 2000 4000 Time • pick up single time point t (say 5000 in above) • try to find PDF at point t • need to solve partial differntial equation (PDE) • Fokker-Planck Equation! 8 / 32
  9. 9. Fokker-Planck Equation • Derived from a continuous time stochastic process (X) • Partial differential equation ∂ ∂φ(p , x ; t ) 1 ∂2 {Vδx φ(p , x ; t )} − = {Mδx φ(p , x ; t )} 2 ∂t 2 ∂x ∂x (1) where • p: initial allele frequency (fixed) • x: allele frequency (random variable) • t: time (continuous variable) • φ(p , x ; t ): PDF • Vδx : variance of δx (amount of change in allele frequency per time) • Mδx : mean of δx (amount of change in allele frequency per time) • Vδx and Mδx : both may depend on x and t 9 / 32
  10. 10. Fokker-Planck Equation for Brownian Motion A standard Brownian motion can be constructed from random walk with error having mean 0 and variance 1 under right scaling. It has the PDF of N(0, t). • when t = 1.0, N(0, 1) • when t = 1.5, N(0, 1.5) Fokker-Planck equation: ∂φ(p , x ; t ) 1 ∂2 = φ(p , x ; t ) ∂t 2 ∂x 2 = Heat equation (2) (3) Mδx = 0 and Vδx = 1 in equation (1) Solution: φ(p .x ; t ) = √ 1 2πt exp −x 2 2t (4) 10 / 32
  11. 11. Solution of the Heat Equation (the Heat Kernel) t = 0.00001 t = 0.01 t=0.1 t=1 t=10 −2 −1 0 1 2 x 11 / 32
  12. 12. Under Random Genetic Drift Mδx = 0 Vδx = x (1 − x ) 2Ne Fokker-Planck equation for random genetic drift: ∂φ(p , x ; t ) 1 ∂2 x (1 − x )φ(p , x ; t ) = ∂t 4Ne ∂x 2 (5) Solutions are obtained as infinite series of sum by... • Kimura (1955) Hypergeometric function • Korn and Korn (1968) Gegenbauer polynomial φ = 6p (1 − p )exp −1 2Ne t + 30p (1 − p )(1 − 2p )(1 − 2x ) −3 2Ne t + ··· , 12 / 32
  13. 13. Solution of FPE (Kimura 1955) VOL. 41) 1955 GENETICS: MOTOO KIMURA 149 FIGS. 1-2.-The processes of the change in the probability distribution of heterallelic classes, due to random sampling of gametes in reproduction. It is assumed that the population starts from the gene frequency 0.5 in Fig. 1 (left) and 0.1 in Fig. 2 (right). t = time in generation; N = effective size of the population; abscissa is gene frequency; ordinate is probability density. 13 / 32
  14. 14. Under Selection and Random Genetic Drift Mδx = sx (1 − x ) Vδx = x (1 − x ) 2Ne ∂ 1 ∂2 ∂φ(p , x ; t ) x (1 − x )φ(p , x ; t ) − s x (1 − x )φ(p , x ; t ) (6) = ∂t 4Ne ∂x 2 ∂x Solutions are obtained as infinite series using oblate spheroidal equation using transformaton of allele frequencies (z = 1-2x) • Kimura (1955) • Kimura and Crow (1956) ∞ (1) φ(p , x , t ) = k =0 Ck exp (−λk t + 2cx )V1k (z ) (7) where (1) V1k (z ) = k 1 fn Tn (z ) n=0,1 14 / 32
  15. 15. Kolmogorov Backward Equation • Derived from a continuous time stochastic process (P) • Partial differential equation ∂ ∂2 ∂φ(p , x ; t ) 1 = Vδp 2 φ(p , x ; t ) + Mδp φ(p , x ; t ) ∂t 2 ∂p ∂p (8) where • p: initial allele frequency (random variable) • x: allele frequency (random variable except x in the time t is fixed) • t: time (continuous variable) • φ(p , x ; t ): PDF • Vδp : variance of δp (amount of change in allele frequency) • Mδp : mean of δp (amount of change in allele frequency) • Vδp and Mδp : both may depend on x but not on t (time homogeneous) 15 / 32
  16. 16. Steady State Distribution of Allele Frequencies Equilibrium • single point (balance between various forces that keep allele frequecies near equilibrium ) • PDF ⇓ PDF of stable equilibrium instead of single point Steady state allele frequency distribution • Fisher (1922), (1930) • Wright (1931), (1937), (1938) φ(p , x ; t ) = solution of a fokker-planck equation lim φ(p , x ; t ) = φ(x ) (10) t →∞ φ(x ) = C exp (2 V δx (9) M δx dx ) Vδx (11) 16 / 32
  17. 17. Steady State Distribution – Random Genetic Drift For a large value of t, only the first few terms have impact on determining the actual form of the PDF. φ = 6p (1 − p )exp −t 2Ne + 30p (1 − p )(1 − 2p )(1 − 2x ) −3t 2Ne + ··· , Asymptotic formula: lim φ = C · exp t →∞ −1 2Ne t 17 / 32
  18. 18. is large can be found directly from the Poisson series according to which the chance of drawing 0 where m is the mean number in a sample i s r m . The contribution to the 0 class will thus be (e-1+e-2+e-3 . . .)f = e-l f , = 0.582f. 1-e-l Graphical Representation (Wright 1931) T 25% 50% 754, Factor Frequ e nc y FIGURE 3.-Distribution of gene frequencies in an isolated population in which fixation and loss of genes each is proceeding at the rate 1/4N in the absence of appreciable selection or muta- 18 / 32
  19. 19. Steady State Distribution – Selection and Mutation Mδx = −ux + v (1 − x ) + ¯ x (1 − x ) d a 2 dx Vδx = x (1 − x ) 2Ne ¯ φ(x ) = C · exp (2Ne a )x 4Ne v −1 (1 − x )4Ne u−1 (12) When A has selecive advantage s over a: ¯ a = 2sx 2 + s2x (1 − x ) + 0 ∗ (1 − x 2 ) = 2sx φ(x ) = C · exp (4Ne sx )x 4Ne v −1 (1 − x )4Ne u−1 (13) 19 / 32
  20. 20. Graphical Representation (Wright 1937) GENETICS: S. WRIGHT 308 PROC. N. A. S. Fig.l Fig 4 Fi9.2 Fig. 5 Fig. 6 20 / 32
  21. 21. Time Series Analysis When variable is measured sequentially in time resulting data form a time series. • Diffusion Model – Continuous time stochastic process • Time Series – Discrete time stochastic process 21 / 32
  22. 22. Basic Models Observations close together in time tend to be correlated • Autoregressive Model: AR(p) p Xt = c + ψi Xt −i + t (14) i =1 • Moving Average Model: MA(q) q Xt = c + θi t −i + t (15) i =1 • Autoregressive Moving Average Model: ARMA (p, q) Xt = AR(p) + MA(q) (16) 22 / 32
  23. 23. Time Series as a Polynomial Equation B k Xt = Xt −k (back shift operator) • AR(p) Xt = ψ1 Xt −1 + · · · + ψp Xt −p Xt = (ψ1 B + · · · + ψp B p )Xt (1 − ψ1 B − · · · − ψp B p )Xt = 0 • ARMA(p,q) Xt − ψ1 Xt −1 − · · · − ψp Xt −p = t + θ1 t −1 + · · · + θq t −q (1 − ψ1 B − · · · − ψp B )Xt = (1 + θ1 B + · · · + θq B q ) p t 23 / 32
  24. 24. Stationary Process The mean and variance do not change over time. No trend. Not stationary Looks like stationary 10 0.8 0.6 5 0.4 0 0.2 −5 0.0 −0.2 −10 2000 4000 6000 8000 10000 2000 4000 6000 8000 Time Figure 6: Random Walk 10000 Time Figure 7: Detrended Detrending: • linear regression • take a difference • Autoregressive Integrated Moving Average: ARIMA(p,d,q) 24 / 32
  25. 25. Application on Allele Frequencies • Influential SNPs – indicative of deterministic trends • Uninfluential SNPs – random fluctuation? • Diffusion Model – assumed Markovian process • Time Series – which model describes the process of change of allele frequencies Application • Objective: model process of change of allele freqeuncies • Data: SNPs genotypes of 4,798 Holstein bulls with 38,416 markers and milk yield • Genotype inputation: FastPhase 1.4 • Estimation of marker effects: BayesCπ 25 / 32
  26. 26. BayesCπ Analysis of human mini-exome sequencing data using a Bayesian hierarchical mixture model: Genetic Analysis Workshop 17 Bueno Filho JS1,2∗ , Morota G1∗ , Tran QT3 , Maenner MJ4 , Vera-Cala LM4,5 , Engelman CD4§ , and Meyers KJ4§ Department of Dairy Science, University of Wisconsin-Madison, USA Departamento de Ciˆncias Exatas, Universidade Federal de Lavras, Brasil e 3 Department of Statistics, University of Wisconsin-Madison, USA 4 Department of Population Health Sciences, University of Wisconsin-Madison, USA 5 Departamento de Salud Publica, Universidad Industrial de Santander, Colombia 1 2 ∗ § Contributed equally to this work Corresponding author Email addresses: JSB: jssbueno@dex.ufla.br Figure GM: morota@wisc.edu QTT: tran@stat.wisc.edu MJM: maenner@waisman.wisc.edu LMV: veracala@wisc.edu CDE: cengelman@wisc.edu KJM: kjmeyers2@wisc.edu 8: GAW17 26 / 32
  27. 27. Allele Frequency of the Top Marker 0.8 0.6 0.4 Allele Frequency Original 0 5 10 15 20 25 30 25 30 Time 0.15 0.00 −0.15 Allele Frequency Detrended 5 10 15 20 Time Figure 9: Time plots of allele frequencies. Top: Original series. Bottom: Smoothed by taking the first order difference. 27 / 32
  28. 28. Autocorrelation and Partial Autocorrelation ARIMA(1,1,1)? Original series 0.2 −0.4 −0.2 0.0 Partial ACF 0.4 0.0 ACF 0.8 0.4 Original series 0 2 4 6 8 10 12 14 2 4 6 8 10 12 First order difference series 14 Lag First order ifference series 0.2 0.0 Partial ACF −0.4 −0.2 0.4 0.0 −0.4 ACF 0.8 0.4 Lag 0 2 4 6 8 Lag 10 12 14 2 4 6 8 10 12 14 Lag Figure 10: ACF and PACF 28 / 32
  29. 29. Model Selection Table 1: Comparison of several competitive models Model ARIMA (1,0,0) ARIMA (0,1,0) ARIMA (0,0,1) AIC -51.56 -49.38 -46.41 Model ARIMA (1,1,0) ARIMA (1,0,1) ARIMA (1,1,1) AIC -52.47 -51.13 -51.02 ARIMA(1,1,0) Xt = 0.635Xt −1 + t 29 / 32
  30. 30. Advanced Models Time dependent variance • ARCH (Autoregressive Conditional Heteroskedasticity) • GARCH (Generalized Autoregressive Conditional Heteroskedasticity) Multivariate • VARMA (Vector Autoregression Moving Average) • BVARMA (Bayesian Vector Autoregression Moving Average) 30 / 32
  31. 31. Intersection of Mathematics and Statistics Under certain condition GARCH(1,1) ≈ Diffusion Model! 31 / 32
  32. 32. Thank you! 32 / 32

×