This document discusses estimating speciation parameters using DNA sequence data. It outlines using a coalescent approach and genetic variation data to build mutation and speciation models. Maximum likelihood estimation is used to estimate parameters for isolation with migration and isolation models using fruit fly data. Approximations show the Jukes-Cantor model provides a good approximation to the infinite sites model when locus length is over 80 sites. Questions are raised about further analyzing longer sequences and addressing rounding errors.
2. Darwinism: Questions Unsolved
But HOW exactly do new species
arise remains a pressing question...
new species arise and develop through
natural selection which can occur
only if individuals vary
6. Genetic Variation Data
Locus : the position of a gene on a
chromosome.
Sequence: exact order of the bases A, T, C
and G in a piece of DNA
AA
TT1 Pairwise difference found!
8. Mutation Model: Independence
θ : mutation rate per 2 lineages
• Mutation ~ Poisson Process of
rate θ per 2 lineage
• E[number of mutation since
MRCA] = θ E[coalescent time]
Coalescent
Time
MRCA
9. Mutation Model: Infinite sites model
• Infinite number of sites on each sequence
• Only a few number of such sequences
• Mutation is assumed to happen on different site every time
• Number of mutation = Number of difference
But now it have been discovered that the sequences are finite and we
have access to large number of such sequences.
11. Mutation Model: JC model
• Mutation can happen on the same base more than once
• Each base can mutate to any of the four possible base
(A,G,C,T) with equal probability 0.25
• Number of pairwise difference given coalescent time has
a Binomial distribution
16. Outline
Coalescent Approach
Genetic Variation Data
Mutation Model
Speciation Model
Maximum Likelihood EstimationMaximum Likelihood Estimation
Fruit Flies Data
Approximation
Conclusion
17. Maximum Likelihood Estimation
IIM Model (7)
Mutation Rate θ
Migration rate M
Ancestral population a
Descendant population 1 c1
Descendant population 2 c2
Migration starting point τ0
Isolation time τ1
Isolation Model (4)
Mutation Rate θ
Ancestral population a
Descendant population 2 c2
Isolation time τ
18. Maximum Likelihood Estimation
• Simulating pairwise difference data based on Binomial distribution
• For both IIM and isolation model, simulate an equal large number of
genes which are from the same descendant population1, 2 and from
different descendant population.
• Maximising the joint log-likelihood.
20. Fruit Flies Data
We studied a set of fruit flies data which was originally analysed by
Wang and Hey(2010) by fitting both IIM and isolation model.
We trim the original sequences into 10 sites long
We manipulate the original data so that we have about 10,000 loci
where two gene are sampled from the same population 1 (D.sim);
and 20,000 loci where two genes are sampled from different
populations 1 and 2 (D.sim and D.mel).
We do not have data where two genes are from the same
population 2, so we can not estimate c2
21. Fruit Flies Data: IIM model
θ τ1 τ0 c1 a M
Infinite
sites
(full
sequence)
2.68 2.69 6.20 2.50 1.20 0.089
JC (last 10
sites)
1.15 1.13 4.08 1.21 1.98 0.098
22. Fruit Flies Data: Isolation model
θ τ a
Infinite sites
(full sequence)
2.68 5.10 1.75
JC (last 10 sites) 1.35 2.75 1.99
23. Fruit Flies Data: Model Comparison
Maximum log-
likelihood
first 10 sites second 10 sites last 10 sites
IIM -32731.8 -49147.34 -32707.25
Isolation -32741.1 -49201.92 -32719.78
2ln( Likelihood Ratio) 18.6 109.16 25.06
• At 1% significant level, = 11.34, so we have sufficient evidence to believe that
the IIM model fits the data significantly better than the isolation model.
• The results of AICs for all 3 sets of 10 sites agree with the conclusion above.
2
3χ
24. Fruit Flies Data
What is the main reason that our results differ from the
results obtained using infinite sites model?
•Difference in modelling
•Loss of information by trimming the data into 10 sites long
Will the results differ if we use 20 sites, 30 sites, 100 sites
and etc.?
26. Approximation
red: 10 sites
green: 20 sites
blue: 80 sites
black: infinite sites
When the
length of locus
gets larger than
80, the infinite
sites model acts
as a good
approximation
to the Juke and
Cantor model
27. Conclusion
• Theoretical results of pairwise nucleotide difference for IIM and isolation
model Jukes and Cantor mutation model
• MLEs for fruit flies data seem different from the results obtained using
infinite sites model, but it might not be due to modelling difference
• Approximation results indicates that when the length of the sites goes up to
80, infinite sites model can be a good approximation of the JC model.
If only data with shorter locus (shorter than 80 sites)is available, we need to
use the exact model (JC model)
28. Questions: Suggestions for future work
• Analysing longer sequences (especially between 10 to 80
sites) with more mathematically explicit packages, for
example, Maple, Mathematica etc.
29. Questions: Fruit Flies Data
MRCA
D.sim1&2 D. mel D.yakuba
triplets of sequences
D.sim1 30,000
D.sim2 30,000
D.mel 30,000
35. Questions: JC model
P(two genes having the same base at that site given T)
= P(no mutations at that site since MRCA)
+ P(at least one mutation at that site and most recent mutation was
to the same base as the other lineage has at that site)
Recall that the mutation follows a Poisson process with rate θ
and so
P(two genes having different bases at that site given T )
= 1 - P(two genes having the same base at that site given T)
Editor's Notes
, which can be represented by a tree. .
Darwinism as modified by the findings of modern genetics
Darwinism as modified by the findings of modern genetics
Darwinism as modified by the findings of modern genetics
Darwinism as modified by the findings of modern genetics
We would not know what kind of demographic scenarios the populatons mihgt have: IIM, Isolation or any others that we have not considered.
Nor would we know the values of the parameters.
where T needs to be simulated by considering both coalescent framework and population structure.