This document discusses estimating speciation parameters using DNA sequence data. It outlines using a coalescent approach and genetic variation data to build mutation and speciation models. Maximum likelihood estimation is used to estimate parameters for isolation with migration and isolation models using fruit fly data. Approximations show the Jukes-Cantor model provides a good approximation to the infinite sites model when locus length is over 80 sites. Questions are raised about further analyzing longer sequences and addressing rounding errors.
This study measured cortisol concentrations in hair samples from beef cattle in Denmark to assess chronic stress. Hair was collected from 24 farms across the country and cortisol was extracted and analyzed. The results showed that hair cortisol: 1) varied geographically and between breeds, ages, and housing/management conditions; 2) was affected by factors like leg cleanliness and physiological state; and 3) while complex, may provide a measure of chronic stress in cattle. Further investigation of the identified factors is needed.
This document provides model answers and guidance for common errors on an IGCSE Physics exam. It addresses several multiple choice and long answer questions related to topics like biology, genetics, and human reproduction. For each question, it identifies common mistakes made by students, such as not reading the question carefully or providing irrelevant information. The document emphasizes reading the full question before answering and using data from tables/figures when instructed. It also provides suggestions like understanding biological processes and using practice exams to improve exam performance.
The document discusses changes made by CIE (Cambridge International Examinations) to some of their question papers for popular assessments. Specifically:
1. CIE uses different variants of question papers for assessments with large numbers of candidates, to maintain best practice in assessment. The question papers are closely related and have been established to provide equal standards of assessment.
2. For some components, there are now two variant question papers, mark schemes, and principal examiner reports, where previously there was only one. Only one variant is intended to be used within each country.
3. A diagram shows the relationship between the question papers, mark schemes, and principal examiner's reports for the variants.
4.
The document describes the design of a continuous process to produce and purify outer membrane vesicles (OMVs) from E. coli. The process involves growing E. coli in a bioreactor to produce OMVs, then using centrifugation and filtration to separate and purify the OMVs. Aspen Plus software was used to simulate the bioreactor growth kinetics, OMV production, and downstream purification steps. The simulation results were analyzed to optimize operating conditions and design scale-up for a potential pilot plant.
This document summarizes a proposal to use Förster resonance energy transfer (FRET) interactions to improve the resolution of super resolution microscopy techniques. It discusses how FRET, which occurs between fluorophores within 10nm of each other, could be used to distinguish between situations where fluorophores appear as single localizations but may actually be multiple nearby molecules. The proposal involves modeling FRET using statistical methods like Markov chains and point processes, simulating transitions with and without FRET, and analyzing photon count distributions to distinguish single from multiple molecule cases. Next steps include developing theoretical models and expanding the approach to incorporate more donors and acceptors.
Three mini-stories are summarized from a document about genomics and bioinformatics in non-model organisms:
1. Building better gene models for the chicken genome using RNA-seq data helped recover annotated isoforms and detect unknown isoforms, resulting in over 15,000 genes and 46,000 transcripts identified.
2. Digital normalization of sequencing data enabled iterative use of new data by setting aside 90% of redundant sequences, allowing analyses that would otherwise be impossible with limited computing resources.
3. Comparing pathway predictions using Ensembl gene models versus gene models constructed with the Gimme pipeline showed both recovered known pathways but Gimme identified additional pathways, demonstrating effects of gene model completeness on downstream analyses.
The document summarizes an introduction to phylogenetics workshop covering DNA alignments, distance matrices, and distance-based tree inference methods. Part I of the workshop introduces phylogenetics and classification of species, describes DNA alignment algorithms like Needleman-Wunsch, and explains distance-based tree building methods like UPGMA and Neighbor-Joining that construct phylogenetic trees from distance matrices. Software for alignments, distance calculations, and tree building is also listed.
This document discusses genetic diversity and clustering analysis (AMOVA) in plants. It provides an overview of the phenomena that cause genetic changes in populations, such as mutation, nonrandom mating, and gene flow. It also discusses different measures of genetic distance and marker informativeness between plant samples. Specific analysis methods discussed include clustering analysis, phylogenetic trees constructed using distance-based methods like UPGMA and neighbor-joining, and Analysis of Molecular Variance (AMOVA). Examples are provided to illustrate genetic distance calculation and the steps involved in UPGMA and neighbor-joining clustering analysis.
This study measured cortisol concentrations in hair samples from beef cattle in Denmark to assess chronic stress. Hair was collected from 24 farms across the country and cortisol was extracted and analyzed. The results showed that hair cortisol: 1) varied geographically and between breeds, ages, and housing/management conditions; 2) was affected by factors like leg cleanliness and physiological state; and 3) while complex, may provide a measure of chronic stress in cattle. Further investigation of the identified factors is needed.
This document provides model answers and guidance for common errors on an IGCSE Physics exam. It addresses several multiple choice and long answer questions related to topics like biology, genetics, and human reproduction. For each question, it identifies common mistakes made by students, such as not reading the question carefully or providing irrelevant information. The document emphasizes reading the full question before answering and using data from tables/figures when instructed. It also provides suggestions like understanding biological processes and using practice exams to improve exam performance.
The document discusses changes made by CIE (Cambridge International Examinations) to some of their question papers for popular assessments. Specifically:
1. CIE uses different variants of question papers for assessments with large numbers of candidates, to maintain best practice in assessment. The question papers are closely related and have been established to provide equal standards of assessment.
2. For some components, there are now two variant question papers, mark schemes, and principal examiner reports, where previously there was only one. Only one variant is intended to be used within each country.
3. A diagram shows the relationship between the question papers, mark schemes, and principal examiner's reports for the variants.
4.
The document describes the design of a continuous process to produce and purify outer membrane vesicles (OMVs) from E. coli. The process involves growing E. coli in a bioreactor to produce OMVs, then using centrifugation and filtration to separate and purify the OMVs. Aspen Plus software was used to simulate the bioreactor growth kinetics, OMV production, and downstream purification steps. The simulation results were analyzed to optimize operating conditions and design scale-up for a potential pilot plant.
This document summarizes a proposal to use Förster resonance energy transfer (FRET) interactions to improve the resolution of super resolution microscopy techniques. It discusses how FRET, which occurs between fluorophores within 10nm of each other, could be used to distinguish between situations where fluorophores appear as single localizations but may actually be multiple nearby molecules. The proposal involves modeling FRET using statistical methods like Markov chains and point processes, simulating transitions with and without FRET, and analyzing photon count distributions to distinguish single from multiple molecule cases. Next steps include developing theoretical models and expanding the approach to incorporate more donors and acceptors.
Three mini-stories are summarized from a document about genomics and bioinformatics in non-model organisms:
1. Building better gene models for the chicken genome using RNA-seq data helped recover annotated isoforms and detect unknown isoforms, resulting in over 15,000 genes and 46,000 transcripts identified.
2. Digital normalization of sequencing data enabled iterative use of new data by setting aside 90% of redundant sequences, allowing analyses that would otherwise be impossible with limited computing resources.
3. Comparing pathway predictions using Ensembl gene models versus gene models constructed with the Gimme pipeline showed both recovered known pathways but Gimme identified additional pathways, demonstrating effects of gene model completeness on downstream analyses.
The document summarizes an introduction to phylogenetics workshop covering DNA alignments, distance matrices, and distance-based tree inference methods. Part I of the workshop introduces phylogenetics and classification of species, describes DNA alignment algorithms like Needleman-Wunsch, and explains distance-based tree building methods like UPGMA and Neighbor-Joining that construct phylogenetic trees from distance matrices. Software for alignments, distance calculations, and tree building is also listed.
This document discusses genetic diversity and clustering analysis (AMOVA) in plants. It provides an overview of the phenomena that cause genetic changes in populations, such as mutation, nonrandom mating, and gene flow. It also discusses different measures of genetic distance and marker informativeness between plant samples. Specific analysis methods discussed include clustering analysis, phylogenetic trees constructed using distance-based methods like UPGMA and neighbor-joining, and Analysis of Molecular Variance (AMOVA). Examples are provided to illustrate genetic distance calculation and the steps involved in UPGMA and neighbor-joining clustering analysis.
This document describes methods for detecting genetic deletions under selection through spatial analysis of human genome data. It outlines using autocorrelation analysis to generate correlograms for deletions, clustering correlograms using k-means and UPGMA, and interpreting results within a diagnostic framework to infer evolutionary processes like drift, migration, and selection. Results show correlograms and clustering can differentiate deletions experiencing isolation, migration, or selection. Limitations and opportunities for improvement are discussed.
This study developed and refined an algorithm to estimate haplotype frequencies from pooled sequencing data. The researchers tested the algorithm on pooled genomic data from recombinant inbred lines of Drosophila melanogaster with a known haplotype structure. Key results showed that increasing marker density and pool size up to 15 founders improved accuracy, while larger window sizes increased accuracy up to a point before declining again. The optimal algorithm settings were determined to be marker densities above 1,000 SNPs/Mb, window sizes based on genetic location, and pool sizes up to 15 founders.
Discussion of latest work on simulating "evolve and resequence" experiments. Covers issues brought up by Burke et al.'s 2010 paper and how the simulations in Baldwin-Brown et al. (2014) address them.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used and can change if the model changes, even if the data remains the same.
- Markov chain Monte Carlo methods are used in Bayesian inference to approximate the posterior probabilities of trees since the exact joint probabilities cannot be calculated analytically. Trees are sampled from their posterior probability distribution to make inferences.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used to generate the sequences. More complex models with more parameters will generally fit the data better but can also overfit.
- Bayesian inference finds the tree topology and parameters that have the highest posterior probability given the data, using Markov chain Monte Carlo sampling to approximate the posterior probabilities when they cannot be calculated directly.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used and can change if the model changes, even if the data remains the same.
- Markov chain Monte Carlo methods are used in Bayesian inference to approximate the posterior probabilities of trees since the exact joint probabilities cannot be calculated analytically. Trees are sampled from their posterior probability distribution to make inferences.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used and can change if the model changes, even if the data remains the same.
- Markov chain Monte Carlo methods are used in Bayesian inference to approximate the posterior probabilities of trees since the exact joint probabilities cannot be calculated analytically. Trees are sampled from their posterior probability distribution to make inferences.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used and can change if the model changes, even if the data remains the same.
- Markov chain Monte Carlo methods are used in Bayesian inference to approximate the posterior probabilities of trees since the exact joint probabilities cannot be calculated analytically. Trees are sampled from their posterior probability distribution to make inferences.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used to generate the sequences. More complex models with more parameters will generally fit the data better but can also overfit.
- Bayesian inference finds the tree topology and parameters that have the highest posterior probability given the data, using Markov chain Monte Carlo sampling to approximate the posterior probabilities when they cannot be calculated directly.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used and can change if the model changes, even if the data remains the same.
- Markov chain Monte Carlo methods are used in Bayesian inference to approximate the posterior probabilities of trees since the exact joint probabilities cannot be calculated analytically. Trees are sampled from their posterior probability distribution to make inferences.
This document discusses using the nematode C. elegans as a model organism to uncover the genetic basis of natural variation in behavior and development. Key points:
1) C. elegans exhibits diverse behaviors and phenotypes in the wild that can be quantified using high-throughput tracking of locomotion features in response to stimuli like CO2.
2) Machine learning techniques like Iterative Denoising Trees are used to reduce the dimensionality of time-series behavior data from many wild C. elegans strains into distinct behavioral profiles.
3) Genome sequencing of wild strains reveals genetic variation that can be tested for association with behavioral profiles using methods like MURAT to identify candidate genes underlying natural phenotypic differences.
Comparing the Amount and Quality of Information from Different Sequencing Str...jembrown
This document compares the amount and quality of phylogenetic information from six amniote phylogenomic datasets. It finds that while median support for major relationships is often strong, there is wide variance in support, both for and against known relationships. This suggests a minimum level of systematic error. Support for turtle placement varies the most between datasets, indicating relatively little phylogenetic information about turtles compared to other amniote groups. Overall, the analysis demonstrates that phylogenomic datasets can differ substantially in information content and reliability.
This document summarizes a paper that presents a new tool called Simulated Annealing Single Cell (SASC) for inferring cancer progression from single-cell sequencing data. SASC allows for the loss of mutations, which violates the common assumption that mutations cannot occur in the same location. SASC uses simulated annealing, a heuristic technique, to search for maximum likelihood trees representing cancer evolution. It was shown to accurately infer ancestor-descendant mutation relationships and detect mutation losses on both simulated and real single-cell sequencing data, outperforming existing tools that do not allow for mutation loss. The paper concludes that SASC provides an improved model for analyzing intra-tumor heterogeneity from single-cell data.
This presentation entitled 'Molecular phylogenetics and its application' deals with all the developmental ideas and basics in the field of bioinformatics.
Pranešimas VII Lietuvos jaunųjų mokslininkų konferencijoje „Operacijų tyrimas ir taikymai“
„Kompiuterininkų dienos – 2015“, Panevėžyje, KTU PTVF 2013-09-18
The document discusses using comparative gene neighborhood analysis and visualization to help understand bacterial gene function from large genome sequence datasets. It describes how genes involved in similar biological processes are often located near each other in bacterial genomes. By comparing gene neighborhoods across different genomes, functions can be predicted for unknown genes. However, this requires analyzing many gene neighborhoods to identify statistically significant patterns. The author's thesis examines designing a new visualization called BactoGeNIE that can scale to "big data" sizes and large displays to enable experts to explore and analyze comparative gene neighborhood data in an interactive way.
Genes located on the same chromosome (linked genes) do not assort independently during meiosis. Linked genes are inherited together and do not follow Mendel's law of independent assortment. Recombination during meiosis can result in new combinations of linked alleles. Chi-squared tests are used to determine if differences between observed and expected phenotypic ratios are statistically significant. Polygenic traits like human height show continuous variation due to environmental influences and effects of multiple genes.
The document provides objectives and instructions for calculating standard deviation, variance, and student's t-test. It defines standard deviation as the positive square root of the arithmetic mean of the squared deviations from the mean. Standard deviation is considered the most reliable measure of variability. Variance is defined as the square of the standard deviation. Student's t-test is used to compare means of two samples and determine if they are statistically different. The document provides examples of calculating standard deviation, variance, and performing matched pairs and independent samples t-tests on sets of data.
The document discusses strategies for working with large biological datasets as sequencing costs decrease and data volumes increase exponentially. It summarizes three key uses for abundant sequencing data: hypothesis falsification, model comparison, and hypothesis generation. The author's lab aims to develop open tools for moving quickly from raw data to hypotheses and identify challenges preventing collaborators from doing their science. Summarizing a discussion on soil microbial communities, it notes the immense diversity and challenges of culture-dependent approaches, necessitating single-cell sequencing and metagenomics.
This document describes methods for detecting genetic deletions under selection through spatial analysis of human genome data. It outlines using autocorrelation analysis to generate correlograms for deletions, clustering correlograms using k-means and UPGMA, and interpreting results within a diagnostic framework to infer evolutionary processes like drift, migration, and selection. Results show correlograms and clustering can differentiate deletions experiencing isolation, migration, or selection. Limitations and opportunities for improvement are discussed.
This study developed and refined an algorithm to estimate haplotype frequencies from pooled sequencing data. The researchers tested the algorithm on pooled genomic data from recombinant inbred lines of Drosophila melanogaster with a known haplotype structure. Key results showed that increasing marker density and pool size up to 15 founders improved accuracy, while larger window sizes increased accuracy up to a point before declining again. The optimal algorithm settings were determined to be marker densities above 1,000 SNPs/Mb, window sizes based on genetic location, and pool sizes up to 15 founders.
Discussion of latest work on simulating "evolve and resequence" experiments. Covers issues brought up by Burke et al.'s 2010 paper and how the simulations in Baldwin-Brown et al. (2014) address them.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used and can change if the model changes, even if the data remains the same.
- Markov chain Monte Carlo methods are used in Bayesian inference to approximate the posterior probabilities of trees since the exact joint probabilities cannot be calculated analytically. Trees are sampled from their posterior probability distribution to make inferences.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used to generate the sequences. More complex models with more parameters will generally fit the data better but can also overfit.
- Bayesian inference finds the tree topology and parameters that have the highest posterior probability given the data, using Markov chain Monte Carlo sampling to approximate the posterior probabilities when they cannot be calculated directly.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used and can change if the model changes, even if the data remains the same.
- Markov chain Monte Carlo methods are used in Bayesian inference to approximate the posterior probabilities of trees since the exact joint probabilities cannot be calculated analytically. Trees are sampled from their posterior probability distribution to make inferences.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used and can change if the model changes, even if the data remains the same.
- Markov chain Monte Carlo methods are used in Bayesian inference to approximate the posterior probabilities of trees since the exact joint probabilities cannot be calculated analytically. Trees are sampled from their posterior probability distribution to make inferences.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used and can change if the model changes, even if the data remains the same.
- Markov chain Monte Carlo methods are used in Bayesian inference to approximate the posterior probabilities of trees since the exact joint probabilities cannot be calculated analytically. Trees are sampled from their posterior probability distribution to make inferences.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used to generate the sequences. More complex models with more parameters will generally fit the data better but can also overfit.
- Bayesian inference finds the tree topology and parameters that have the highest posterior probability given the data, using Markov chain Monte Carlo sampling to approximate the posterior probabilities when they cannot be calculated directly.
- Maximum likelihood attempts to find the phylogenetic tree and evolutionary model that have the highest probability of producing the observed sequence data.
- The likelihood of observing the data depends on the evolutionary model used and can change if the model changes, even if the data remains the same.
- Markov chain Monte Carlo methods are used in Bayesian inference to approximate the posterior probabilities of trees since the exact joint probabilities cannot be calculated analytically. Trees are sampled from their posterior probability distribution to make inferences.
This document discusses using the nematode C. elegans as a model organism to uncover the genetic basis of natural variation in behavior and development. Key points:
1) C. elegans exhibits diverse behaviors and phenotypes in the wild that can be quantified using high-throughput tracking of locomotion features in response to stimuli like CO2.
2) Machine learning techniques like Iterative Denoising Trees are used to reduce the dimensionality of time-series behavior data from many wild C. elegans strains into distinct behavioral profiles.
3) Genome sequencing of wild strains reveals genetic variation that can be tested for association with behavioral profiles using methods like MURAT to identify candidate genes underlying natural phenotypic differences.
Comparing the Amount and Quality of Information from Different Sequencing Str...jembrown
This document compares the amount and quality of phylogenetic information from six amniote phylogenomic datasets. It finds that while median support for major relationships is often strong, there is wide variance in support, both for and against known relationships. This suggests a minimum level of systematic error. Support for turtle placement varies the most between datasets, indicating relatively little phylogenetic information about turtles compared to other amniote groups. Overall, the analysis demonstrates that phylogenomic datasets can differ substantially in information content and reliability.
This document summarizes a paper that presents a new tool called Simulated Annealing Single Cell (SASC) for inferring cancer progression from single-cell sequencing data. SASC allows for the loss of mutations, which violates the common assumption that mutations cannot occur in the same location. SASC uses simulated annealing, a heuristic technique, to search for maximum likelihood trees representing cancer evolution. It was shown to accurately infer ancestor-descendant mutation relationships and detect mutation losses on both simulated and real single-cell sequencing data, outperforming existing tools that do not allow for mutation loss. The paper concludes that SASC provides an improved model for analyzing intra-tumor heterogeneity from single-cell data.
This presentation entitled 'Molecular phylogenetics and its application' deals with all the developmental ideas and basics in the field of bioinformatics.
Pranešimas VII Lietuvos jaunųjų mokslininkų konferencijoje „Operacijų tyrimas ir taikymai“
„Kompiuterininkų dienos – 2015“, Panevėžyje, KTU PTVF 2013-09-18
The document discusses using comparative gene neighborhood analysis and visualization to help understand bacterial gene function from large genome sequence datasets. It describes how genes involved in similar biological processes are often located near each other in bacterial genomes. By comparing gene neighborhoods across different genomes, functions can be predicted for unknown genes. However, this requires analyzing many gene neighborhoods to identify statistically significant patterns. The author's thesis examines designing a new visualization called BactoGeNIE that can scale to "big data" sizes and large displays to enable experts to explore and analyze comparative gene neighborhood data in an interactive way.
Genes located on the same chromosome (linked genes) do not assort independently during meiosis. Linked genes are inherited together and do not follow Mendel's law of independent assortment. Recombination during meiosis can result in new combinations of linked alleles. Chi-squared tests are used to determine if differences between observed and expected phenotypic ratios are statistically significant. Polygenic traits like human height show continuous variation due to environmental influences and effects of multiple genes.
The document provides objectives and instructions for calculating standard deviation, variance, and student's t-test. It defines standard deviation as the positive square root of the arithmetic mean of the squared deviations from the mean. Standard deviation is considered the most reliable measure of variability. Variance is defined as the square of the standard deviation. Student's t-test is used to compare means of two samples and determine if they are statistically different. The document provides examples of calculating standard deviation, variance, and performing matched pairs and independent samples t-tests on sets of data.
The document discusses strategies for working with large biological datasets as sequencing costs decrease and data volumes increase exponentially. It summarizes three key uses for abundant sequencing data: hypothesis falsification, model comparison, and hypothesis generation. The author's lab aims to develop open tools for moving quickly from raw data to hypotheses and identify challenges preventing collaborators from doing their science. Summarizing a discussion on soil microbial communities, it notes the immense diversity and challenges of culture-dependent approaches, necessitating single-cell sequencing and metagenomics.
2. Darwinism: Questions Unsolved
But HOW exactly do new species
arise remains a pressing question...
new species arise and develop through
natural selection which can occur
only if individuals vary
6. Genetic Variation Data
Locus : the position of a gene on a
chromosome.
Sequence: exact order of the bases A, T, C
and G in a piece of DNA
AA
TT1 Pairwise difference found!
8. Mutation Model: Independence
θ : mutation rate per 2 lineages
• Mutation ~ Poisson Process of
rate θ per 2 lineage
• E[number of mutation since
MRCA] = θ E[coalescent time]
Coalescent
Time
MRCA
9. Mutation Model: Infinite sites model
• Infinite number of sites on each sequence
• Only a few number of such sequences
• Mutation is assumed to happen on different site every time
• Number of mutation = Number of difference
But now it have been discovered that the sequences are finite and we
have access to large number of such sequences.
11. Mutation Model: JC model
• Mutation can happen on the same base more than once
• Each base can mutate to any of the four possible base
(A,G,C,T) with equal probability 0.25
• Number of pairwise difference given coalescent time has
a Binomial distribution
16. Outline
Coalescent Approach
Genetic Variation Data
Mutation Model
Speciation Model
Maximum Likelihood EstimationMaximum Likelihood Estimation
Fruit Flies Data
Approximation
Conclusion
17. Maximum Likelihood Estimation
IIM Model (7)
Mutation Rate θ
Migration rate M
Ancestral population a
Descendant population 1 c1
Descendant population 2 c2
Migration starting point τ0
Isolation time τ1
Isolation Model (4)
Mutation Rate θ
Ancestral population a
Descendant population 2 c2
Isolation time τ
18. Maximum Likelihood Estimation
• Simulating pairwise difference data based on Binomial distribution
• For both IIM and isolation model, simulate an equal large number of
genes which are from the same descendant population1, 2 and from
different descendant population.
• Maximising the joint log-likelihood.
20. Fruit Flies Data
We studied a set of fruit flies data which was originally analysed by
Wang and Hey(2010) by fitting both IIM and isolation model.
We trim the original sequences into 10 sites long
We manipulate the original data so that we have about 10,000 loci
where two gene are sampled from the same population 1 (D.sim);
and 20,000 loci where two genes are sampled from different
populations 1 and 2 (D.sim and D.mel).
We do not have data where two genes are from the same
population 2, so we can not estimate c2
21. Fruit Flies Data: IIM model
θ τ1 τ0 c1 a M
Infinite
sites
(full
sequence)
2.68 2.69 6.20 2.50 1.20 0.089
JC (last 10
sites)
1.15 1.13 4.08 1.21 1.98 0.098
22. Fruit Flies Data: Isolation model
θ τ a
Infinite sites
(full sequence)
2.68 5.10 1.75
JC (last 10 sites) 1.35 2.75 1.99
23. Fruit Flies Data: Model Comparison
Maximum log-
likelihood
first 10 sites second 10 sites last 10 sites
IIM -32731.8 -49147.34 -32707.25
Isolation -32741.1 -49201.92 -32719.78
2ln( Likelihood Ratio) 18.6 109.16 25.06
• At 1% significant level, = 11.34, so we have sufficient evidence to believe that
the IIM model fits the data significantly better than the isolation model.
• The results of AICs for all 3 sets of 10 sites agree with the conclusion above.
2
3χ
24. Fruit Flies Data
What is the main reason that our results differ from the
results obtained using infinite sites model?
•Difference in modelling
•Loss of information by trimming the data into 10 sites long
Will the results differ if we use 20 sites, 30 sites, 100 sites
and etc.?
26. Approximation
red: 10 sites
green: 20 sites
blue: 80 sites
black: infinite sites
When the
length of locus
gets larger than
80, the infinite
sites model acts
as a good
approximation
to the Juke and
Cantor model
27. Conclusion
• Theoretical results of pairwise nucleotide difference for IIM and isolation
model Jukes and Cantor mutation model
• MLEs for fruit flies data seem different from the results obtained using
infinite sites model, but it might not be due to modelling difference
• Approximation results indicates that when the length of the sites goes up to
80, infinite sites model can be a good approximation of the JC model.
If only data with shorter locus (shorter than 80 sites)is available, we need to
use the exact model (JC model)
28. Questions: Suggestions for future work
• Analysing longer sequences (especially between 10 to 80
sites) with more mathematically explicit packages, for
example, Maple, Mathematica etc.
29. Questions: Fruit Flies Data
MRCA
D.sim1&2 D. mel D.yakuba
triplets of sequences
D.sim1 30,000
D.sim2 30,000
D.mel 30,000
35. Questions: JC model
P(two genes having the same base at that site given T)
= P(no mutations at that site since MRCA)
+ P(at least one mutation at that site and most recent mutation was
to the same base as the other lineage has at that site)
Recall that the mutation follows a Poisson process with rate θ
and so
P(two genes having different bases at that site given T )
= 1 - P(two genes having the same base at that site given T)
Editor's Notes
, which can be represented by a tree. .
Darwinism as modified by the findings of modern genetics
Darwinism as modified by the findings of modern genetics
Darwinism as modified by the findings of modern genetics
Darwinism as modified by the findings of modern genetics
We would not know what kind of demographic scenarios the populatons mihgt have: IIM, Isolation or any others that we have not considered.
Nor would we know the values of the parameters.
where T needs to be simulated by considering both coalescent framework and population structure.