SlideShare a Scribd company logo
1 of 35
Estimating Speciation Parameters
Using DNA Sequence Data
Name: Daitong Li
Supervisor: Dr H Herbots
Module: STAT3901
Darwinism: Questions Unsolved
But HOW exactly do new species
arise remains a pressing question...
new species arise and develop through
natural selection which can occur
only if individuals vary
Outline

Coalescent ApproachCoalescent Approach

Genetic Variation Data

Mutation Model

Speciation Model

Maximum Likelihood Estimation

Fruit Flies Data

Approximation

Conclusion
Coalescent Approach: Assumptions
1. Constant size, non-
overlapping
generations
2. No selection
3. No recombination
→ exponential
coalescent time
Outline

Coalescent Approach

Genetic Variation DataGenetic Variation Data

Mutation Model

Speciation Model

Maximum Likelihood Estimation

Fruit Flies Data

Approximation

Conclusion
Genetic Variation Data
Locus : the position of a gene on a
chromosome.
Sequence: exact order of the bases A, T, C
and G in a piece of DNA
AA
TT1 Pairwise difference found!
Outline

Coalescent Approach

Genetic Variation Data

Mutation ModelMutation Model

Speciation Model

Maximum Likelihood Estimation

Fruit Flies Data

Approximation

Conclusion
Mutation Model: Independence
θ : mutation rate per 2 lineages
• Mutation ~ Poisson Process of
rate θ per 2 lineage
• E[number of mutation since
MRCA] = θ E[coalescent time]
Coalescent
Time
MRCA
Mutation Model: Infinite sites model
• Infinite number of sites on each sequence
• Only a few number of such sequences
• Mutation is assumed to happen on different site every time
• Number of mutation = Number of difference
But now it have been discovered that the sequences are finite and we
have access to large number of such sequences.
Mutation Model: Finite sites model
Number of mutation >= Number of difference
Mutation Model: JC model
• Mutation can happen on the same base more than once
• Each base can mutate to any of the four possible base
(A,G,C,T) with equal probability 0.25
• Number of pairwise difference given coalescent time has
a Binomial distribution
Outline

Coalescent Approach

Genetic Variation Data

Mutation Model

Speciation ModelSpeciation Model

Maximum Likelihood Estimation

Fruit Flies Data

Approximation

Conclusion
Speciation Model: Migration
Speciation Model: Isolation Model
isolation time
size of descendant
population 1 and 2
size of ancestral
population
Speciation Model: IIM Model
isolation time
migration rate
migration starting
point
Outline

Coalescent Approach

Genetic Variation Data

Mutation Model

Speciation Model

Maximum Likelihood EstimationMaximum Likelihood Estimation

Fruit Flies Data

Approximation

Conclusion
Maximum Likelihood Estimation
IIM Model (7)
Mutation Rate θ
Migration rate M
Ancestral population a
Descendant population 1 c1
Descendant population 2 c2
Migration starting point τ0
Isolation time τ1
Isolation Model (4)
Mutation Rate θ
Ancestral population a
Descendant population 2 c2
Isolation time τ
Maximum Likelihood Estimation
• Simulating pairwise difference data based on Binomial distribution
• For both IIM and isolation model, simulate an equal large number of
genes which are from the same descendant population1, 2 and from
different descendant population.
• Maximising the joint log-likelihood.
Outline

Coalescent Approach

Genetic Variation Data

Mutation Model

Speciation Model

Maximum Likelihood Estimation

Fruit Flies DataFruit Flies Data

Approximation

Conclusion
Fruit Flies Data

We studied a set of fruit flies data which was originally analysed by
Wang and Hey(2010) by fitting both IIM and isolation model.

We trim the original sequences into 10 sites long

We manipulate the original data so that we have about 10,000 loci
where two gene are sampled from the same population 1 (D.sim);
and 20,000 loci where two genes are sampled from different
populations 1 and 2 (D.sim and D.mel).

We do not have data where two genes are from the same
population 2, so we can not estimate c2
Fruit Flies Data: IIM model
θ τ1 τ0 c1 a M
Infinite
sites
(full
sequence)
2.68 2.69 6.20 2.50 1.20 0.089
JC (last 10
sites)
1.15 1.13 4.08 1.21 1.98 0.098
Fruit Flies Data: Isolation model
θ τ a
Infinite sites
(full sequence)
2.68 5.10 1.75
JC (last 10 sites) 1.35 2.75 1.99
Fruit Flies Data: Model Comparison
Maximum log-
likelihood
first 10 sites second 10 sites last 10 sites
IIM -32731.8 -49147.34 -32707.25
Isolation -32741.1 -49201.92 -32719.78
2ln( Likelihood Ratio) 18.6 109.16 25.06
• At 1% significant level, = 11.34, so we have sufficient evidence to believe that
the IIM model fits the data significantly better than the isolation model.
• The results of AICs for all 3 sets of 10 sites agree with the conclusion above.
2
3χ
Fruit Flies Data
What is the main reason that our results differ from the
results obtained using infinite sites model?
•Difference in modelling
•Loss of information by trimming the data into 10 sites long
Will the results differ if we use 20 sites, 30 sites, 100 sites
and etc.?
Outline

Coalescent Approach

Genetic Variation Data

Mutation Model

Speciation Model

Maximum Likelihood Estimation

Fruit Flies Data

ApproximationApproximation

Conclusion
Approximation
red: 10 sites
green: 20 sites
blue: 80 sites
black: infinite sites
When the
length of locus
gets larger than
80, the infinite
sites model acts
as a good
approximation
to the Juke and
Cantor model
Conclusion
• Theoretical results of pairwise nucleotide difference for IIM and isolation
model Jukes and Cantor mutation model
• MLEs for fruit flies data seem different from the results obtained using
infinite sites model, but it might not be due to modelling difference
• Approximation results indicates that when the length of the sites goes up to
80, infinite sites model can be a good approximation of the JC model.

If only data with shorter locus (shorter than 80 sites)is available, we need to
use the exact model (JC model)
Questions: Suggestions for future work
• Analysing longer sequences (especially between 10 to 80
sites) with more mathematically explicit packages, for
example, Maple, Mathematica etc.
Questions: Fruit Flies Data
MRCA
D.sim1&2 D. mel D.yakuba
triplets of sequences
D.sim1 30,000
D.sim2 30,000
D.mel 30,000
Questions: Rounding Error
n=30, k=16, j=1, b=2 = 1.0542e-03
b=3 = -4.9174e-03
+ < 0 !
1
2
1 2
Questions: Rounding Error
[1] 1.301120e-01 2.012619e-01 1.301120e-01 3.478349e-02 3.032439e-01
[6] 2.012619e-01 5.954263e-02 3.465417e-02 1.965132e-02 3.032439e-01
[11] 4.441905e-02 2.012619e-01 2.012619e-01 2.012619e-01 2.012619e-01
[16] 3.032439e-01 8.547706e-02 2.012619e-01 2.012619e-01 3.032439e-01
[21] 3.032439e-01 8.547706e-02 3.032439e-01 3.032439e-01 3.032439e-01
[26] 2.012619e-01 -1.460063e+04 2.265334e-02 1.301120e-01 4.441905e-02
[31] 3.032439e-01 3.032439e-01 3.032439e-01 1.301120e-01 1.301120e-01
[36] 3.032439e-01 2.012619e-01 3.465417e-02 3.032439e-01 1.301120e-01
[41] 3.032439e-01 1.301120e-01 5.954263e-02 3.032439e-01 3.032439e-01
[46] 2.788816e-02 3.032439e-01 1.301120e-01 2.012619e-01 3.032439e-01
2 1 2 6 0 1 4 10 9 0 5 1 1 1 1 0 3 1 1 0 0 3 0 0 0 1
16 8 2 5 0 0 0 2 2 0 1 10 0 2 0 2 4 0 0 7 0 2 1 0
Questions: Converted Parameters (IIM)
θ τ1 τ0 c1 a M
Infinite sites
(full sequence)
2.68 1.58myrs 3.62myrs 7.30m 3.51m 7.59e-09/yr
JC
(last 10 sites)
1.15 0.66myrs 2.38myrs 3.53m 5.78m 8.29e-09/yr
Questions: Converted Parameters (Isol.)
θ τ a
Infinite sites
(full sequence)
2.68 2.97myrs 5.16m
JC (last 10 sites) 1.35 1.61myrs 5.81m
Approximation
red: 10 sites
green: 20 sites
blue: 80 sites
black: infinite sites
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 4.000 6.000 6.354 8.000 32.000 Infinite sites
0.000 3.000 4.000 4.119 5.000 10.000 10 sites
0.00 3.00 5.00 5.06 7.00 16.00 20 sites
0.000 4.000 6.000 6.051 8.000 28.000 80 sites
Questions: JC model
P(two genes having the same base at that site given T)
= P(no mutations at that site since MRCA)
+ P(at least one mutation at that site and most recent mutation was
to the same base as the other lineage has at that site)
Recall that the mutation follows a Poisson process with rate θ
and so
P(two genes having different bases at that site given T )
= 1 - P(two genes having the same base at that site given T)

More Related Content

Similar to Daitong_Li_Presentation

SA of Genome_YanzheYin
SA of Genome_YanzheYinSA of Genome_YanzheYin
SA of Genome_YanzheYinYanzhe Yin
 
Pooled Sequence Haplotype Estimator
Pooled Sequence Haplotype EstimatorPooled Sequence Haplotype Estimator
Pooled Sequence Haplotype EstimatorDevin Petersohn
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihoodHarry Potter
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihoodJames Wong
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihoodHoang Nguyen
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihoodYoung Alista
 
Data miningmaximumlikelihood
Data miningmaximumlikelihoodData miningmaximumlikelihood
Data miningmaximumlikelihoodFraboni Ec
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihoodTony Nguyen
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihoodLuis Goldster
 
Comparing the Amount and Quality of Information from Different Sequencing Str...
Comparing the Amount and Quality of Information from Different Sequencing Str...Comparing the Amount and Quality of Information from Different Sequencing Str...
Comparing the Amount and Quality of Information from Different Sequencing Str...jembrown
 
CDAC 2018 Ciccolella inferring
CDAC 2018 Ciccolella inferringCDAC 2018 Ciccolella inferring
CDAC 2018 Ciccolella inferringMarco Antoniotti
 
Open Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesisOpen Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesisAntica Culina
 
Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3Jillian Aurisano
 
10.2 inheritance
10.2 inheritance10.2 inheritance
10.2 inheritanceBob Smullen
 

Similar to Daitong_Li_Presentation (20)

SA of Genome_YanzheYin
SA of Genome_YanzheYinSA of Genome_YanzheYin
SA of Genome_YanzheYin
 
Pooled Sequence Haplotype Estimator
Pooled Sequence Haplotype EstimatorPooled Sequence Haplotype Estimator
Pooled Sequence Haplotype Estimator
 
Vivo vitrothingamajig
Vivo vitrothingamajigVivo vitrothingamajig
Vivo vitrothingamajig
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data miningmaximumlikelihood
Data miningmaximumlikelihoodData miningmaximumlikelihood
Data miningmaximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
2016 10-27 timbers
2016 10-27 timbers2016 10-27 timbers
2016 10-27 timbers
 
Comparing the Amount and Quality of Information from Different Sequencing Str...
Comparing the Amount and Quality of Information from Different Sequencing Str...Comparing the Amount and Quality of Information from Different Sequencing Str...
Comparing the Amount and Quality of Information from Different Sequencing Str...
 
CDAC 2018 Ciccolella inferring
CDAC 2018 Ciccolella inferringCDAC 2018 Ciccolella inferring
CDAC 2018 Ciccolella inferring
 
Molecular phylogenetics
Molecular phylogeneticsMolecular phylogenetics
Molecular phylogenetics
 
Agnė DZIDOLIKAITĖ. Evolutionary Approach in Optimization
Agnė DZIDOLIKAITĖ. Evolutionary Approach in OptimizationAgnė DZIDOLIKAITĖ. Evolutionary Approach in Optimization
Agnė DZIDOLIKAITĖ. Evolutionary Approach in Optimization
 
Open Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesisOpen Data and Ecological and Evolutionary synthesis
Open Data and Ecological and Evolutionary synthesis
 
Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3Jillian ms defense-4-14-14-ja-novid3
Jillian ms defense-4-14-14-ja-novid3
 
10.2 inheritance
10.2 inheritance10.2 inheritance
10.2 inheritance
 
Standard deviation
Standard deviationStandard deviation
Standard deviation
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 

Daitong_Li_Presentation

  • 1. Estimating Speciation Parameters Using DNA Sequence Data Name: Daitong Li Supervisor: Dr H Herbots Module: STAT3901
  • 2. Darwinism: Questions Unsolved But HOW exactly do new species arise remains a pressing question... new species arise and develop through natural selection which can occur only if individuals vary
  • 3. Outline  Coalescent ApproachCoalescent Approach  Genetic Variation Data  Mutation Model  Speciation Model  Maximum Likelihood Estimation  Fruit Flies Data  Approximation  Conclusion
  • 4. Coalescent Approach: Assumptions 1. Constant size, non- overlapping generations 2. No selection 3. No recombination → exponential coalescent time
  • 5. Outline  Coalescent Approach  Genetic Variation DataGenetic Variation Data  Mutation Model  Speciation Model  Maximum Likelihood Estimation  Fruit Flies Data  Approximation  Conclusion
  • 6. Genetic Variation Data Locus : the position of a gene on a chromosome. Sequence: exact order of the bases A, T, C and G in a piece of DNA AA TT1 Pairwise difference found!
  • 7. Outline  Coalescent Approach  Genetic Variation Data  Mutation ModelMutation Model  Speciation Model  Maximum Likelihood Estimation  Fruit Flies Data  Approximation  Conclusion
  • 8. Mutation Model: Independence θ : mutation rate per 2 lineages • Mutation ~ Poisson Process of rate θ per 2 lineage • E[number of mutation since MRCA] = θ E[coalescent time] Coalescent Time MRCA
  • 9. Mutation Model: Infinite sites model • Infinite number of sites on each sequence • Only a few number of such sequences • Mutation is assumed to happen on different site every time • Number of mutation = Number of difference But now it have been discovered that the sequences are finite and we have access to large number of such sequences.
  • 10. Mutation Model: Finite sites model Number of mutation >= Number of difference
  • 11. Mutation Model: JC model • Mutation can happen on the same base more than once • Each base can mutate to any of the four possible base (A,G,C,T) with equal probability 0.25 • Number of pairwise difference given coalescent time has a Binomial distribution
  • 12. Outline  Coalescent Approach  Genetic Variation Data  Mutation Model  Speciation ModelSpeciation Model  Maximum Likelihood Estimation  Fruit Flies Data  Approximation  Conclusion
  • 14. Speciation Model: Isolation Model isolation time size of descendant population 1 and 2 size of ancestral population
  • 15. Speciation Model: IIM Model isolation time migration rate migration starting point
  • 16. Outline  Coalescent Approach  Genetic Variation Data  Mutation Model  Speciation Model  Maximum Likelihood EstimationMaximum Likelihood Estimation  Fruit Flies Data  Approximation  Conclusion
  • 17. Maximum Likelihood Estimation IIM Model (7) Mutation Rate θ Migration rate M Ancestral population a Descendant population 1 c1 Descendant population 2 c2 Migration starting point τ0 Isolation time τ1 Isolation Model (4) Mutation Rate θ Ancestral population a Descendant population 2 c2 Isolation time τ
  • 18. Maximum Likelihood Estimation • Simulating pairwise difference data based on Binomial distribution • For both IIM and isolation model, simulate an equal large number of genes which are from the same descendant population1, 2 and from different descendant population. • Maximising the joint log-likelihood.
  • 19. Outline  Coalescent Approach  Genetic Variation Data  Mutation Model  Speciation Model  Maximum Likelihood Estimation  Fruit Flies DataFruit Flies Data  Approximation  Conclusion
  • 20. Fruit Flies Data  We studied a set of fruit flies data which was originally analysed by Wang and Hey(2010) by fitting both IIM and isolation model.  We trim the original sequences into 10 sites long  We manipulate the original data so that we have about 10,000 loci where two gene are sampled from the same population 1 (D.sim); and 20,000 loci where two genes are sampled from different populations 1 and 2 (D.sim and D.mel).  We do not have data where two genes are from the same population 2, so we can not estimate c2
  • 21. Fruit Flies Data: IIM model θ τ1 τ0 c1 a M Infinite sites (full sequence) 2.68 2.69 6.20 2.50 1.20 0.089 JC (last 10 sites) 1.15 1.13 4.08 1.21 1.98 0.098
  • 22. Fruit Flies Data: Isolation model θ τ a Infinite sites (full sequence) 2.68 5.10 1.75 JC (last 10 sites) 1.35 2.75 1.99
  • 23. Fruit Flies Data: Model Comparison Maximum log- likelihood first 10 sites second 10 sites last 10 sites IIM -32731.8 -49147.34 -32707.25 Isolation -32741.1 -49201.92 -32719.78 2ln( Likelihood Ratio) 18.6 109.16 25.06 • At 1% significant level, = 11.34, so we have sufficient evidence to believe that the IIM model fits the data significantly better than the isolation model. • The results of AICs for all 3 sets of 10 sites agree with the conclusion above. 2 3χ
  • 24. Fruit Flies Data What is the main reason that our results differ from the results obtained using infinite sites model? •Difference in modelling •Loss of information by trimming the data into 10 sites long Will the results differ if we use 20 sites, 30 sites, 100 sites and etc.?
  • 25. Outline  Coalescent Approach  Genetic Variation Data  Mutation Model  Speciation Model  Maximum Likelihood Estimation  Fruit Flies Data  ApproximationApproximation  Conclusion
  • 26. Approximation red: 10 sites green: 20 sites blue: 80 sites black: infinite sites When the length of locus gets larger than 80, the infinite sites model acts as a good approximation to the Juke and Cantor model
  • 27. Conclusion • Theoretical results of pairwise nucleotide difference for IIM and isolation model Jukes and Cantor mutation model • MLEs for fruit flies data seem different from the results obtained using infinite sites model, but it might not be due to modelling difference • Approximation results indicates that when the length of the sites goes up to 80, infinite sites model can be a good approximation of the JC model.  If only data with shorter locus (shorter than 80 sites)is available, we need to use the exact model (JC model)
  • 28. Questions: Suggestions for future work • Analysing longer sequences (especially between 10 to 80 sites) with more mathematically explicit packages, for example, Maple, Mathematica etc.
  • 29. Questions: Fruit Flies Data MRCA D.sim1&2 D. mel D.yakuba triplets of sequences D.sim1 30,000 D.sim2 30,000 D.mel 30,000
  • 30. Questions: Rounding Error n=30, k=16, j=1, b=2 = 1.0542e-03 b=3 = -4.9174e-03 + < 0 ! 1 2 1 2
  • 31. Questions: Rounding Error [1] 1.301120e-01 2.012619e-01 1.301120e-01 3.478349e-02 3.032439e-01 [6] 2.012619e-01 5.954263e-02 3.465417e-02 1.965132e-02 3.032439e-01 [11] 4.441905e-02 2.012619e-01 2.012619e-01 2.012619e-01 2.012619e-01 [16] 3.032439e-01 8.547706e-02 2.012619e-01 2.012619e-01 3.032439e-01 [21] 3.032439e-01 8.547706e-02 3.032439e-01 3.032439e-01 3.032439e-01 [26] 2.012619e-01 -1.460063e+04 2.265334e-02 1.301120e-01 4.441905e-02 [31] 3.032439e-01 3.032439e-01 3.032439e-01 1.301120e-01 1.301120e-01 [36] 3.032439e-01 2.012619e-01 3.465417e-02 3.032439e-01 1.301120e-01 [41] 3.032439e-01 1.301120e-01 5.954263e-02 3.032439e-01 3.032439e-01 [46] 2.788816e-02 3.032439e-01 1.301120e-01 2.012619e-01 3.032439e-01 2 1 2 6 0 1 4 10 9 0 5 1 1 1 1 0 3 1 1 0 0 3 0 0 0 1 16 8 2 5 0 0 0 2 2 0 1 10 0 2 0 2 4 0 0 7 0 2 1 0
  • 32. Questions: Converted Parameters (IIM) θ τ1 τ0 c1 a M Infinite sites (full sequence) 2.68 1.58myrs 3.62myrs 7.30m 3.51m 7.59e-09/yr JC (last 10 sites) 1.15 0.66myrs 2.38myrs 3.53m 5.78m 8.29e-09/yr
  • 33. Questions: Converted Parameters (Isol.) θ τ a Infinite sites (full sequence) 2.68 2.97myrs 5.16m JC (last 10 sites) 1.35 1.61myrs 5.81m
  • 34. Approximation red: 10 sites green: 20 sites blue: 80 sites black: infinite sites Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 4.000 6.000 6.354 8.000 32.000 Infinite sites 0.000 3.000 4.000 4.119 5.000 10.000 10 sites 0.00 3.00 5.00 5.06 7.00 16.00 20 sites 0.000 4.000 6.000 6.051 8.000 28.000 80 sites
  • 35. Questions: JC model P(two genes having the same base at that site given T) = P(no mutations at that site since MRCA) + P(at least one mutation at that site and most recent mutation was to the same base as the other lineage has at that site) Recall that the mutation follows a Poisson process with rate θ and so P(two genes having different bases at that site given T ) = 1 - P(two genes having the same base at that site given T)

Editor's Notes

  1. , which can be represented by a tree. .
  2. Darwinism as modified by the findings of modern genetics
  3. Darwinism as modified by the findings of modern genetics
  4. Darwinism as modified by the findings of modern genetics
  5. Darwinism as modified by the findings of modern genetics
  6. We would not know what kind of demographic scenarios the populatons mihgt have: IIM, Isolation or any others that we have not considered. Nor would we know the values of the parameters.
  7. where T needs to be simulated by considering both coalescent framework and population structure.