Daitong_Li_Presentation

Estimating Speciation Parameters
Using DNA Sequence Data
Name: Daitong Li
Supervisor: Dr H Herbots
Module: STAT3901

Darwinism: Questions Unsolved
But HOW exactly do new species
arise remains a pressing question...
new species arise and develop through
natural selection which can occur
only if individuals vary

Outline

Coalescent ApproachCoalescent Approach

Genetic Variation Data

Mutation Model

Speciation Model

Maximum Likelihood Estimation

Fruit Flies Data

Approximation

Conclusion

Coalescent Approach: Assumptions
1. Constant size, non-
overlapping
generations
2. No selection
3. No recombination
→ exponential
coalescent time

Outline

Coalescent Approach

Genetic Variation DataGenetic Variation Data

Mutation Model

Speciation Model


Fruit Flies Data

Approximation

Conclusion

Locus : the position of a gene on a
chromosome.
Sequence: exact order of the bases A, T, C
and G in a piece of DNA
AA
TT1 Pairwise difference found!

Outline

Coalescent Approach


Mutation ModelMutation Model

Speciation Model


Fruit Flies Data

Approximation

Conclusion

Mutation Model: Independence
θ : mutation rate per 2 lineages
• Mutation ~ Poisson Process of
rate θ per 2 lineage
• E[number of mutation since
MRCA] = θ E[coalescent time]
Coalescent
Time
MRCA

Mutation Model: Infinite sites model
• Infinite number of sites on each sequence
• Only a few number of such sequences
• Mutation is assumed to happen on different site every time
• Number of mutation = Number of difference
But now it have been discovered that the sequences are finite and we
have access to large number of such sequences.

Mutation Model: Finite sites model
Number of mutation >= Number of difference

Mutation Model: JC model
• Mutation can happen on the same base more than once
• Each base can mutate to any of the four possible base
(A,G,C,T) with equal probability 0.25
• Number of pairwise difference given coalescent time has
a Binomial distribution

Outline

Coalescent Approach


Mutation Model

Speciation ModelSpeciation Model


Fruit Flies Data

Approximation

Conclusion

Speciation Model: Isolation Model
isolation time
size of descendant
population 1 and 2
size of ancestral
population

Speciation Model: IIM Model
isolation time
migration rate
migration starting
point

Outline

Coalescent Approach


Mutation Model

Speciation Model

Maximum Likelihood EstimationMaximum Likelihood Estimation

Fruit Flies Data

Approximation

Conclusion

IIM Model (7)
Mutation Rate θ
Migration rate M
Ancestral population a
Descendant population 1 c1
Migration starting point τ0
Isolation time τ1
Isolation Model (4)
Mutation Rate θ
Ancestral population a
Isolation time τ

• Simulating pairwise difference data based on Binomial distribution
• For both IIM and isolation model, simulate an equal large number of
genes which are from the same descendant population1, 2 and from
different descendant population.
• Maximising the joint log-likelihood.

Outline

Coalescent Approach


Mutation Model

Speciation Model


Fruit Flies DataFruit Flies Data

Approximation

Conclusion

Fruit Flies Data

We studied a set of fruit flies data which was originally analysed by
Wang and Hey(2010) by fitting both IIM and isolation model.

We trim the original sequences into 10 sites long

We manipulate the original data so that we have about 10,000 loci
where two gene are sampled from the same population 1 (D.sim);
and 20,000 loci where two genes are sampled from different
populations 1 and 2 (D.sim and D.mel).

We do not have data where two genes are from the same
population 2, so we can not estimate c2

Fruit Flies Data: IIM model
θ τ1 τ0 c1 a M
Infinite
sites
(full
sequence)
2.68 2.69 6.20 2.50 1.20 0.089
JC (last 10
sites)
1.15 1.13 4.08 1.21 1.98 0.098

Fruit Flies Data: Isolation model
θ τ a
Infinite sites
(full sequence)
2.68 5.10 1.75
JC (last 10 sites) 1.35 2.75 1.99

Fruit Flies Data: Model Comparison
Maximum log-
likelihood
first 10 sites second 10 sites last 10 sites
IIM -32731.8 -49147.34 -32707.25
Isolation -32741.1 -49201.92 -32719.78
2ln( Likelihood Ratio) 18.6 109.16 25.06
• At 1% significant level, = 11.34, so we have sufficient evidence to believe that
the IIM model fits the data significantly better than the isolation model.
• The results of AICs for all 3 sets of 10 sites agree with the conclusion above.
2
3χ

Fruit Flies Data
What is the main reason that our results differ from the
results obtained using infinite sites model?
•Difference in modelling
•Loss of information by trimming the data into 10 sites long
Will the results differ if we use 20 sites, 30 sites, 100 sites
and etc.?

Outline

Coalescent Approach


Mutation Model

Speciation Model


Fruit Flies Data

ApproximationApproximation

Conclusion

Approximation
red: 10 sites
green: 20 sites
blue: 80 sites
black: infinite sites
When the
length of locus
gets larger than
80, the infinite
sites model acts
as a good
approximation
to the Juke and
Cantor model

Conclusion
• Theoretical results of pairwise nucleotide difference for IIM and isolation
model Jukes and Cantor mutation model
• MLEs for fruit flies data seem different from the results obtained using
infinite sites model, but it might not be due to modelling difference
• Approximation results indicates that when the length of the sites goes up to
80, infinite sites model can be a good approximation of the JC model.

If only data with shorter locus (shorter than 80 sites)is available, we need to
use the exact model (JC model)

Questions: Suggestions for future work
• Analysing longer sequences (especially between 10 to 80
sites) with more mathematically explicit packages, for
example, Maple, Mathematica etc.

Questions: Fruit Flies Data
MRCA
D.sim1&2 D. mel D.yakuba
triplets of sequences
D.sim1 30,000
D.sim2 30,000
D.mel 30,000

Questions: Rounding Error
n=30, k=16, j=1, b=2 = 1.0542e-03
b=3 = -4.9174e-03
+ < 0 !
1
2
1 2

Questions: Rounding Error
[1] 1.301120e-01 2.012619e-01 1.301120e-01 3.478349e-02 3.032439e-01
[6] 2.012619e-01 5.954263e-02 3.465417e-02 1.965132e-02 3.032439e-01
[11] 4.441905e-02 2.012619e-01 2.012619e-01 2.012619e-01 2.012619e-01
[16] 3.032439e-01 8.547706e-02 2.012619e-01 2.012619e-01 3.032439e-01
[21] 3.032439e-01 8.547706e-02 3.032439e-01 3.032439e-01 3.032439e-01
[26] 2.012619e-01 -1.460063e+04 2.265334e-02 1.301120e-01 4.441905e-02
[31] 3.032439e-01 3.032439e-01 3.032439e-01 1.301120e-01 1.301120e-01
[36] 3.032439e-01 2.012619e-01 3.465417e-02 3.032439e-01 1.301120e-01
[41] 3.032439e-01 1.301120e-01 5.954263e-02 3.032439e-01 3.032439e-01
[46] 2.788816e-02 3.032439e-01 1.301120e-01 2.012619e-01 3.032439e-01
2 1 2 6 0 1 4 10 9 0 5 1 1 1 1 0 3 1 1 0 0 3 0 0 0 1
16 8 2 5 0 0 0 2 2 0 1 10 0 2 0 2 4 0 0 7 0 2 1 0

Questions: Converted Parameters (IIM)
θ τ1 τ0 c1 a M
Infinite sites
(full sequence)
2.68 1.58myrs 3.62myrs 7.30m 3.51m 7.59e-09/yr
JC
(last 10 sites)
1.15 0.66myrs 2.38myrs 3.53m 5.78m 8.29e-09/yr

Questions: Converted Parameters (Isol.)
θ τ a
Infinite sites
(full sequence)
2.68 2.97myrs 5.16m
JC (last 10 sites) 1.35 1.61myrs 5.81m

Approximation
red: 10 sites
green: 20 sites
blue: 80 sites
black: infinite sites
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 4.000 6.000 6.354 8.000 32.000 Infinite sites
0.000 3.000 4.000 4.119 5.000 10.000 10 sites
0.00 3.00 5.00 5.06 7.00 16.00 20 sites
0.000 4.000 6.000 6.051 8.000 28.000 80 sites

Questions: JC model
P(two genes having the same base at that site given T)
= P(no mutations at that site since MRCA)
+ P(at least one mutation at that site and most recent mutation was
to the same base as the other lineage has at that site)
Recall that the mutation follows a Poisson process with rate θ
and so
P(two genes having different bases at that site given T )
= 1 - P(two genes having the same base at that site given T)

Daitong_Li_Presentation

Recommended

Recommended

More Related Content

Similar to Daitong_Li_Presentation

Similar to Daitong_Li_Presentation (20)

Daitong_Li_Presentation

Editor's Notes