Phylogenetic models and MCMC methods for the reconstruction of language history Robin J. Ryder CEREMADE – Paris Dauphine /...
Carles li reis, nostre emper[er]e magnes Set anz tuz pleins ad estet en Espaigne : Tresqu’en la mer cunquist la tere altai...
La plus commune façon d'amollir les coeurs de ceux qu'on a offensez, lors qu'ayant la vengeance en main, ils nous tiennent...
Tes yeux sont si profonds qu'en me penchant pour boire J'ai vu tous les soleils y venir se mirer S'y jeter à mourir tous l...
Et la piaule swingue au son du ghetto, on tape à la porte Chill c'est trop fort ! baisse le son merde ! j'connais A chaque...
What to expect <ul><li>Description of the data
Model of language diversification
MCMC for phylogenetic trees
Synthetic studies
Analysis of two data sets </li></ul>
Indo-European languages
Indo-European languages
Language diversification Languages change in a way comparable to biological species Similarities between languages indicat...
 
Questions <ul><li>Topology
Internal ages
Age of the root: 6000-6500 BP or 8000-9500 BP?
(BP=Before Present) </li></ul>
Core vocabulary <ul><li>100 or 200 meanings, present in almost all languages :  bird, hand, to eat, red...
Borrowing is possible (non-tree-like change), but:
“ Easy” to detect
Uncommon
Does not introduce systematic bias </li></ul>
Data coding Old English:  stierfþ Old High German:  stirbit ,  touwit Avestan:  miriiete Old Church Slavonic:  umĭretŭ Lat...
Constraints <ul><li>Constraints on parts of the topology
Constraints on some internal ages
We use these constraints to infer rates and other ages </li></ul>
 
Description of the model (1)‏ <ul><li>Traits are born at rate  λ
Trait instances die at rate μ
λ and μ are constants </li></ul>
Description of the model (2)‏ <ul><li>Catastrophes occur at rate  ρ
At a catastrophe, each trait dies with probability κ and Poiss(ν) traits are born.
λ/μ=ν/κ: the number of traits is constant on average. </li></ul>
Description of the model (3)‏ <ul><li>Observation model: each data point (0s and 1s) is missing with probability ξ
Some traits are not observed and are therefore deleted from the data </li></ul>
Registration process
Registration process
Registration process
Registration process
Posterior distribution
Likelihood calculations
Prior distribution on trees <ul><li>Our main focus is on the root age
We would like the marginal prior on the root age to be (approximately) uniform over (say) 5000-15000BP </li></ul>
MCMC moves <ul><li>Random walk on the parameters
Various moves on the tree (Drummond et al., 2002) </li></ul>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Checking mixing and convergence <ul><li>Auto-correlations
Need statistics on the tree
Length of the tree
Root age
Presence/Absence of a few subtrees </li></ul>
Upcoming SlideShare
Loading in...5
×

Phylogenetic models and MCMC methods for the reconstruction of language history

1,417

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,417
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Phylogenetic models and MCMC methods for the reconstruction of language history

  1. 1. Phylogenetic models and MCMC methods for the reconstruction of language history Robin J. Ryder CEREMADE – Paris Dauphine / CREST – INSEE Joint work with Geoff K. Nicholls at the Department of Statistics, University of Oxford www.slideshare.net/robinryder
  2. 2. Carles li reis, nostre emper[er]e magnes Set anz tuz pleins ad estet en Espaigne : Tresqu’en la mer cunquist la tere altaigne. N’i ad castel ki devant lui remaigne ; Mur ne citet n’i est remes a fraindre, Fors Sarraguce, ki est en une muntaigne. Chanson de Roland , 1r (11 th century)
  3. 3. La plus commune façon d'amollir les coeurs de ceux qu'on a offensez, lors qu'ayant la vengeance en main, ils nous tiennent à leur mercy, c'est de les esmouvoir par submission à commiseration et à pitié. Montaigne, Essais , I, 1 (1580)
  4. 4. Tes yeux sont si profonds qu'en me penchant pour boire J'ai vu tous les soleils y venir se mirer S'y jeter à mourir tous les désespérés Tes yeux sont si profonds que j'y perds la mémoire Aragon, Les Yeux d'Elsa (1942)
  5. 5. Et la piaule swingue au son du ghetto, on tape à la porte Chill c'est trop fort ! baisse le son merde ! j'connais A chaque fois c'est pareil tant pis il faut qu'ça pète Et profite en traître des nouveaux albums qu'Rod m'achète Akhénaton, Juste une pression (2005)
  6. 6. What to expect <ul><li>Description of the data
  7. 7. Model of language diversification
  8. 8. MCMC for phylogenetic trees
  9. 9. Synthetic studies
  10. 10. Analysis of two data sets </li></ul>
  11. 11. Indo-European languages
  12. 12. Indo-European languages
  13. 13. Language diversification Languages change in a way comparable to biological species Similarities between languages indicate that they may be cousins. Most common model : phylogenetic tree
  14. 15. Questions <ul><li>Topology
  15. 16. Internal ages
  16. 17. Age of the root: 6000-6500 BP or 8000-9500 BP?
  17. 18. (BP=Before Present) </li></ul>
  18. 19. Core vocabulary <ul><li>100 or 200 meanings, present in almost all languages : bird, hand, to eat, red...
  19. 20. Borrowing is possible (non-tree-like change), but:
  20. 21. “ Easy” to detect
  21. 22. Uncommon
  22. 23. Does not introduce systematic bias </li></ul>
  23. 24. Data coding Old English: stierfþ Old High German: stirbit , touwit Avestan: miriiete Old Church Slavonic: umĭretŭ Latin: moritur Oscan: ? Cognacy classes: 1. {stierfþ, stirbit} 2. {touwit} 3. {miriiete, umĭretŭ, moritur}
  24. 25. Constraints <ul><li>Constraints on parts of the topology
  25. 26. Constraints on some internal ages
  26. 27. We use these constraints to infer rates and other ages </li></ul>
  27. 29. Description of the model (1)‏ <ul><li>Traits are born at rate λ
  28. 30. Trait instances die at rate μ
  29. 31. λ and μ are constants </li></ul>
  30. 32. Description of the model (2)‏ <ul><li>Catastrophes occur at rate ρ
  31. 33. At a catastrophe, each trait dies with probability κ and Poiss(ν) traits are born.
  32. 34. λ/μ=ν/κ: the number of traits is constant on average. </li></ul>
  33. 35. Description of the model (3)‏ <ul><li>Observation model: each data point (0s and 1s) is missing with probability ξ
  34. 36. Some traits are not observed and are therefore deleted from the data </li></ul>
  35. 37. Registration process
  36. 38. Registration process
  37. 39. Registration process
  38. 40. Registration process
  39. 41. Posterior distribution
  40. 42. Likelihood calculations
  41. 43. Prior distribution on trees <ul><li>Our main focus is on the root age
  42. 44. We would like the marginal prior on the root age to be (approximately) uniform over (say) 5000-15000BP </li></ul>
  43. 45. MCMC moves <ul><li>Random walk on the parameters
  44. 46. Various moves on the tree (Drummond et al., 2002) </li></ul>
  45. 66. Checking mixing and convergence <ul><li>Auto-correlations
  46. 67. Need statistics on the tree
  47. 68. Length of the tree
  48. 69. Root age
  49. 70. Presence/Absence of a few subtrees </li></ul>
  50. 71. Synthetic data True tree, ~40 words/language Consensus tree
  51. 72. Synthetic data (2)‏ Death rate (μ)
  52. 73. Influence of borrowing True tree, ~40 words/language Borrowing: 10% Consensus tree
  53. 74. Influence of borrowing (2) Consensus tree True tree, ~40 words/language Borrowing: 50%
  54. 75. Influence of borrowing (3) <ul><li>Topology is reconstructed correctly
  55. 76. Dates are underestimated for high levels of borrowing </li></ul>Root age Death rate ( μ) Borrowing: 50%
  56. 77. Detecting borrowing Confirmed: hardly any borrowing!
  57. 78. Data used <ul><li>Indo-European languages
  58. 79. Core vocabulary (Swadesh 100 or 200)
  59. 80. Two independent data sets
  60. 81. Dyen et al. (1997): 87 languages, mostly modern
  61. 82. Ringe et al. (2002): 24 languages, mostly ancient </li></ul>
  62. 83. Constraints
  63. 84. Cross-validation
  64. 97. Root age
  65. 98. Conclusions <ul><li>Strong support for the Anatolian hypothesis: root age around 8000BP. No support for the Kurgan hypothesis.
  66. 99. Applicable to a variety of linguistic and cultural data sets
  67. 100. TraitLab: it's free! </li></ul>
  68. 101. Questions otázky spørgsmåler vragen questions Fragen domande pytania questões întrebări вопросы vprašanja preguntes preguntas frågor vrae spurningar quaestiones ερωτήσεις въпроси kesses spørsmåler kláusimai запитанні سوال प्रश्न cwestiwnau
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×