## Just for you: FREE 60-day trial to the world’s largest digital library.

The SlideShare family just got bigger. Enjoy access to millions of ebooks, audiobooks, magazines, and more from Scribd.

Cancel anytime.Free with a 14 day trial from Scribd

- 1. Allele frequencies as Stochastic Processes Mathematical and Statistical Approaches Gota Morota Nov 30, 2010 1 / 32
- 2. Outline Change of Allele Frequencies as Stochastic Processes Steady State Distributions of Allele Frequencies Time Series Analysis 2 / 32
- 3. Outline Change of Allele Frequencies as Stochastic Processes Steady State Distributions of Allele Frequencies Time Series Analysis 3 / 32
- 4. Outline Change of Allele Frequencies as Stochastic Processes Steady State Distributions of Allele Frequencies Time Series Analysis 4 / 32
- 5. Various factors affecting allele frequencies • Selection, mutation and migration (cross breedings) ⇒ systematic pressures (Wright 1949) • Random ﬂuctuations 1. Random sampling of gametes (genetic drift) 2. Random ﬂuctuation in systematic pressures ⇓ Allele frequencies are funcions of the systematic forces and the random components 5 / 32
- 6. Random walk ⇒ Brownian Motion 0.10 −0.010 0.05 −0.015 0.00 −0.020 −0.05 −0.025 −0.10 −0.030 −0.15 −0.20 −0.035 −0.25 −0.040 200 2 4 6 8 400 10 600 800 1000 Time Time Figure 3: Time = [1:1000] Figure 1: Time = [1,10] 0.8 −0.02 0.6 −0.04 0.4 −0.06 0.2 −0.08 0.0 −0.10 −0.2 2000 20 40 60 80 100 Time Figure 2: Time = [1:100] 4000 6000 8000 10000 Time Figure 4: Time = [1:10000] 6 / 32
- 7. Brownian Motion ⇒ Diffusion Model 0.8 0.6 + conditional on forces 0.4 0.2 Systematic 0.0 −0.2 2000 4000 6000 8000 10000 Time Figure 5: Time = [1:10000] • treat change of allele frequencies as stochastic porcess ⇓ Diffusion Model 7 / 32
- 8. Diffusion Model Allele Frequency It frames inﬁnite number of paths that allele fequencies would take over time under certain systematic pressures. 0 2000 4000 6000 8000 10000 6000 8000 10000 6000 8000 10000 Allele Frequency Time 0 2000 4000 Allele Frequency Time 0 2000 4000 Time • pick up single time point t (say 5000 in above) • try to ﬁnd PDF at point t • need to solve partial differntial equation (PDE) • Fokker-Planck Equation! 8 / 32
- 9. Fokker-Planck Equation • Derived from a continuous time stochastic process (X) • Partial differential equation ∂ ∂φ(p , x ; t ) 1 ∂2 {Vδx φ(p , x ; t )} − = {Mδx φ(p , x ; t )} 2 ∂t 2 ∂x ∂x (1) where • p: initial allele frequency (ﬁxed) • x: allele frequency (random variable) • t: time (continuous variable) • φ(p , x ; t ): PDF • Vδx : variance of δx (amount of change in allele frequency per time) • Mδx : mean of δx (amount of change in allele frequency per time) • Vδx and Mδx : both may depend on x and t 9 / 32
- 10. Fokker-Planck Equation for Brownian Motion A standard Brownian motion can be constructed from random walk with error having mean 0 and variance 1 under right scaling. It has the PDF of N(0, t). • when t = 1.0, N(0, 1) • when t = 1.5, N(0, 1.5) Fokker-Planck equation: ∂φ(p , x ; t ) 1 ∂2 = φ(p , x ; t ) ∂t 2 ∂x 2 = Heat equation (2) (3) Mδx = 0 and Vδx = 1 in equation (1) Solution: φ(p .x ; t ) = √ 1 2πt exp −x 2 2t (4) 10 / 32
- 11. Solution of the Heat Equation (the Heat Kernel) t = 0.00001 t = 0.01 t=0.1 t=1 t=10 −2 −1 0 1 2 x 11 / 32
- 12. Under Random Genetic Drift Mδx = 0 Vδx = x (1 − x ) 2Ne Fokker-Planck equation for random genetic drift: ∂φ(p , x ; t ) 1 ∂2 x (1 − x )φ(p , x ; t ) = ∂t 4Ne ∂x 2 (5) Solutions are obtained as inﬁnite series of sum by... • Kimura (1955) Hypergeometric function • Korn and Korn (1968) Gegenbauer polynomial φ = 6p (1 − p )exp −1 2Ne t + 30p (1 − p )(1 − 2p )(1 − 2x ) −3 2Ne t + ··· , 12 / 32
- 13. Solution of FPE (Kimura 1955) VOL. 41) 1955 GENETICS: MOTOO KIMURA 149 FIGS. 1-2.-The processes of the change in the probability distribution of heterallelic classes, due to random sampling of gametes in reproduction. It is assumed that the population starts from the gene frequency 0.5 in Fig. 1 (left) and 0.1 in Fig. 2 (right). t = time in generation; N = effective size of the population; abscissa is gene frequency; ordinate is probability density. 13 / 32
- 14. Under Selection and Random Genetic Drift Mδx = sx (1 − x ) Vδx = x (1 − x ) 2Ne ∂ 1 ∂2 ∂φ(p , x ; t ) x (1 − x )φ(p , x ; t ) − s x (1 − x )φ(p , x ; t ) (6) = ∂t 4Ne ∂x 2 ∂x Solutions are obtained as inﬁnite series using oblate spheroidal equation using transformaton of allele frequencies (z = 1-2x) • Kimura (1955) • Kimura and Crow (1956) ∞ (1) φ(p , x , t ) = k =0 Ck exp (−λk t + 2cx )V1k (z ) (7) where (1) V1k (z ) = k 1 fn Tn (z ) n=0,1 14 / 32
- 15. Kolmogorov Backward Equation • Derived from a continuous time stochastic process (P) • Partial differential equation ∂ ∂2 ∂φ(p , x ; t ) 1 = Vδp 2 φ(p , x ; t ) + Mδp φ(p , x ; t ) ∂t 2 ∂p ∂p (8) where • p: initial allele frequency (random variable) • x: allele frequency (random variable except x in the time t is ﬁxed) • t: time (continuous variable) • φ(p , x ; t ): PDF • Vδp : variance of δp (amount of change in allele frequency) • Mδp : mean of δp (amount of change in allele frequency) • Vδp and Mδp : both may depend on x but not on t (time homogeneous) 15 / 32
- 16. Steady State Distribution of Allele Frequencies Equilibrium • single point (balance between various forces that keep allele frequecies near equilibrium ) • PDF ⇓ PDF of stable equilibrium instead of single point Steady state allele frequency distribution • Fisher (1922), (1930) • Wright (1931), (1937), (1938) φ(p , x ; t ) = solution of a fokker-planck equation lim φ(p , x ; t ) = φ(x ) (10) t →∞ φ(x ) = C exp (2 V δx (9) M δx dx ) Vδx (11) 16 / 32
- 17. Steady State Distribution – Random Genetic Drift For a large value of t, only the ﬁrst few terms have impact on determining the actual form of the PDF. φ = 6p (1 − p )exp −t 2Ne + 30p (1 − p )(1 − 2p )(1 − 2x ) −3t 2Ne + ··· , Asymptotic formula: lim φ = C · exp t →∞ −1 2Ne t 17 / 32
- 18. is large can be found directly from the Poisson series according to which the chance of drawing 0 where m is the mean number in a sample i s r m . The contribution to the 0 class will thus be (e-1+e-2+e-3 . . .)f = e-l f , = 0.582f. 1-e-l Graphical Representation (Wright 1931) T 25% 50% 754, Factor Frequ e nc y FIGURE 3.-Distribution of gene frequencies in an isolated population in which fixation and loss of genes each is proceeding at the rate 1/4N in the absence of appreciable selection or muta- 18 / 32
- 19. Steady State Distribution – Selection and Mutation Mδx = −ux + v (1 − x ) + ¯ x (1 − x ) d a 2 dx Vδx = x (1 − x ) 2Ne ¯ φ(x ) = C · exp (2Ne a )x 4Ne v −1 (1 − x )4Ne u−1 (12) When A has selecive advantage s over a: ¯ a = 2sx 2 + s2x (1 − x ) + 0 ∗ (1 − x 2 ) = 2sx φ(x ) = C · exp (4Ne sx )x 4Ne v −1 (1 − x )4Ne u−1 (13) 19 / 32
- 20. Graphical Representation (Wright 1937) GENETICS: S. WRIGHT 308 PROC. N. A. S. Fig.l Fig 4 Fi9.2 Fig. 5 Fig. 6 20 / 32
- 21. Time Series Analysis When variable is measured sequentially in time resulting data form a time series. • Diffusion Model – Continuous time stochastic process • Time Series – Discrete time stochastic process 21 / 32
- 22. Basic Models Observations close together in time tend to be correlated • Autoregressive Model: AR(p) p Xt = c + ψi Xt −i + t (14) i =1 • Moving Average Model: MA(q) q Xt = c + θi t −i + t (15) i =1 • Autoregressive Moving Average Model: ARMA (p, q) Xt = AR(p) + MA(q) (16) 22 / 32
- 23. Time Series as a Polynomial Equation B k Xt = Xt −k (back shift operator) • AR(p) Xt = ψ1 Xt −1 + · · · + ψp Xt −p Xt = (ψ1 B + · · · + ψp B p )Xt (1 − ψ1 B − · · · − ψp B p )Xt = 0 • ARMA(p,q) Xt − ψ1 Xt −1 − · · · − ψp Xt −p = t + θ1 t −1 + · · · + θq t −q (1 − ψ1 B − · · · − ψp B )Xt = (1 + θ1 B + · · · + θq B q ) p t 23 / 32
- 24. Stationary Process The mean and variance do not change over time. No trend. Not stationary Looks like stationary 10 0.8 0.6 5 0.4 0 0.2 −5 0.0 −0.2 −10 2000 4000 6000 8000 10000 2000 4000 6000 8000 Time Figure 6: Random Walk 10000 Time Figure 7: Detrended Detrending: • linear regression • take a difference • Autoregressive Integrated Moving Average: ARIMA(p,d,q) 24 / 32
- 25. Application on Allele Frequencies • Inﬂuential SNPs – indicative of deterministic trends • Uninﬂuential SNPs – random ﬂuctuation? • Diffusion Model – assumed Markovian process • Time Series – which model describes the process of change of allele frequencies Application • Objective: model process of change of allele freqeuncies • Data: SNPs genotypes of 4,798 Holstein bulls with 38,416 markers and milk yield • Genotype inputation: FastPhase 1.4 • Estimation of marker effects: BayesCπ 25 / 32
- 26. BayesCπ Analysis of human mini-exome sequencing data using a Bayesian hierarchical mixture model: Genetic Analysis Workshop 17 Bueno Filho JS1,2∗ , Morota G1∗ , Tran QT3 , Maenner MJ4 , Vera-Cala LM4,5 , Engelman CD4§ , and Meyers KJ4§ Department of Dairy Science, University of Wisconsin-Madison, USA Departamento de Ciˆncias Exatas, Universidade Federal de Lavras, Brasil e 3 Department of Statistics, University of Wisconsin-Madison, USA 4 Department of Population Health Sciences, University of Wisconsin-Madison, USA 5 Departamento de Salud Publica, Universidad Industrial de Santander, Colombia 1 2 ∗ § Contributed equally to this work Corresponding author Email addresses: JSB: jssbueno@dex.uﬂa.br Figure GM: morota@wisc.edu QTT: tran@stat.wisc.edu MJM: maenner@waisman.wisc.edu LMV: veracala@wisc.edu CDE: cengelman@wisc.edu KJM: kjmeyers2@wisc.edu 8: GAW17 26 / 32
- 27. Allele Frequency of the Top Marker 0.8 0.6 0.4 Allele Frequency Original 0 5 10 15 20 25 30 25 30 Time 0.15 0.00 −0.15 Allele Frequency Detrended 5 10 15 20 Time Figure 9: Time plots of allele frequencies. Top: Original series. Bottom: Smoothed by taking the ﬁrst order difference. 27 / 32
- 28. Autocorrelation and Partial Autocorrelation ARIMA(1,1,1)? Original series 0.2 −0.4 −0.2 0.0 Partial ACF 0.4 0.0 ACF 0.8 0.4 Original series 0 2 4 6 8 10 12 14 2 4 6 8 10 12 First order difference series 14 Lag First order ifference series 0.2 0.0 Partial ACF −0.4 −0.2 0.4 0.0 −0.4 ACF 0.8 0.4 Lag 0 2 4 6 8 Lag 10 12 14 2 4 6 8 10 12 14 Lag Figure 10: ACF and PACF 28 / 32
- 29. Model Selection Table 1: Comparison of several competitive models Model ARIMA (1,0,0) ARIMA (0,1,0) ARIMA (0,0,1) AIC -51.56 -49.38 -46.41 Model ARIMA (1,1,0) ARIMA (1,0,1) ARIMA (1,1,1) AIC -52.47 -51.13 -51.02 ARIMA(1,1,0) Xt = 0.635Xt −1 + t 29 / 32
- 30. Advanced Models Time dependent variance • ARCH (Autoregressive Conditional Heteroskedasticity) • GARCH (Generalized Autoregressive Conditional Heteroskedasticity) Multivariate • VARMA (Vector Autoregression Moving Average) • BVARMA (Bayesian Vector Autoregression Moving Average) 30 / 32
- 31. Intersection of Mathematics and Statistics Under certain condition GARCH(1,1) ≈ Diffusion Model! 31 / 32
- 32. Thank you! 32 / 32