Successfully reported this slideshow.
Your SlideShare is downloading. ×

A phylogenetic model of language diversification

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 92 Ad
Advertisement

More Related Content

Recently uploaded (20)

Advertisement

A phylogenetic model of language diversification

  1. 1. A Phylogenetic Model of Language Diversification Robin J. Ryder1 et Geoff K. Nicholls2 1 CEREMADE, Université Paris-Dauphine 2 Department of Statistics, University of Oxford UCLA, March 2013 www.slideshare.net/robinryder
  2. 2. Gray and Atkinson’s tree(s) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 2 / 81
  3. 3. Caveats I am not a linguist Statistics: additional insight alongside the comparative method I use the word "evolution" in a broad sense "All models all false, but some are useful" R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 3 / 81
  4. 4. Advantages of statistical methods Analyse (very) large datasets Test multiple hypotheses Cross-validation Estimate uncertainty R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 4 / 81
  5. 5. Questions to answer Topology of the tree Age of ancestor nodes Age of root: 6000-6500 BP or 8000-9500 BP (Before Present) ? 6000 BP: Kurgan horsemen ; 8000 BP: Anatolian farmers R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 5 / 81
  6. 6. Statistical method in a nutshell 1 Collect data 2 Design model 3 Perform inference (MCMC, ...) 4 Check convergence 5 In-model validation (is our inference method able to answer questions from our model?) 6 Model mis-specification analysis (do we need a more complex model?) 7 Conclude R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 6 / 81
  7. 7. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 7 / 81
  8. 8. Morris Swadesh and glottochronology 200/100 word list Compares 2 languages (c=fraction of shared cognates) Assumes r =fraction of shared cognates after 1000 years constant for all languages (86%) Infers age t of Most Recent Common Ancestor ˆ = ln c t 2 ln r R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 8 / 81
  9. 9. all dog grass long river split walk and drink green louse road warm animal dry man root squeeze guts ashes rope stab wash dull many hair at dust rotten stand water hand meat back ear round star he moon stick we bad earth rub head bark eat mother salt stone wet hear egg heart sand what because mountain straight eye heavy say belly mouth suck when fall here name sun big scratch where far hit narrow swell bird fat sea hold near swim white bite father see horn neck tail black fear seed who how new ten blood sew hunt night that wide blow feather sharp nose there wife bone few husband short not they breast fight I sing wind old thick fire ice one sit thin wing breathe fish if other skin think burn five in sky wipe child this float kill person sleep thou claw flow knee with play small three cloud flower know pull smell throw cold fly lake woman tie come R. Ryder & G. Nicholls (Dauphine & Oxford) fog push laugh Language phylogenies UCLA 2013 9 / 81 woods
  10. 10. Bergsland and Vogt (1962) Found different rates for different pairs of languages: Old Norse and Icelandic, Georgian and Mingrelian, Armenian and Old Armenian Discredited Glottochronology Sankoff (1973): sample selection bias, no estimation of uncertainty Fair criticism Bad observation protocol from Swadesh Does not apply (so much) to modern methods R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 10 / 81
  11. 11. Core vocabulary 100 or 200 words, present in almost all languages: bird, hand, to eat, red... Borrowing can occur (evolution not along a tree), but: R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 11 / 81
  12. 12. Core vocabulary 100 or 200 words, present in almost all languages: bird, hand, to eat, red... Borrowing can occur (evolution not along a tree), but: “Easy” to detect Rare Does not bias the results R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 11 / 81
  13. 13. Binary data: he dies, three, all il meurt trois tout Old English stierfþ þr¯eı ealle Old High German stirbit, touwit dr¯ ı alle Avestan miriiete ¯ ¯ þraiio vispe Old Church Slavonic ı ˘ um˘retu tr˘je ı v˘si ı Latin moritur ¯ tres omnes ¯ Oscan ? trís súllus Cognacy classes (traits) for the meaning he dies: R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 12 / 81
  14. 14. Binary data: he dies, three, all il meurt trois tout Old English stierfþ þr¯eı ealle Old High German stirbit, touwit dr¯ ı alle Avestan miriiete ¯ ¯ þraiio vispe Old Church Slavonic ı ˘ um˘retu tr˘je ı v˘si ı Latin moritur ¯ tres omnes ¯ Oscan ? trís súllus Cognacy classes (traits) for the meaning he dies: 1 {stierfþ, stirbit} 2 {touwit} 3 ı ˘ {miriiete, um˘retu, moritur} R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 12 / 81
  15. 15. Binary data: he dies, three, all il meurt trois tout Old English stierfþ þr¯eı ealle Old High German stirbit, touwit dr¯ ı alle Avestan miriiete ¯ ¯ þraiio vispe Old Church Slavonic ı ˘ um˘retu tr˘je ı v˘si ı Latin moritur ¯ tres omnes ¯ Oscan ? trís súllus O. English 1 0 0 Cognacy classes (traits) for the OH German 1 1 0 meaning he dies: Avestan 0 0 1 1 {stierfþ, stirbit} OC Slavonic 0 0 1 2 {touwit} Latin 0 0 1 3 ı ˘ {miriiete, um˘retu, moritur} Oscan ? ? ? R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 12 / 81
  16. 16. Binary data: he dies, three, all il meurt trois tout Old English stierfþ þr¯eı ealle Old High German stirbit, touwit dr¯ ı alle Avestan miriiete ¯ ¯ þraiio vispe Old Church Slavonic ı ˘ um˘retu tr˘je ı v˘si ı Latin moritur ¯ tres omnes ¯ Oscan ? trís súllus O. English 1 0 0 1 Cognacy classes for OH German 1 1 0 1 the meaning three: Avestan 0 0 1 1 1 ¯ ¯ ı ¯ {þr¯e, dr¯, þraiio, tr˘je, tres, trís} ı ı V.-slave 0 0 1 1 Latin 0 0 1 1 Osque ? ? ? 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 12 / 81
  17. 17. Binary data: he dies, three, all il meurt trois tout Old English stierfþ þr¯eı ealle Old High German stirbit, touwit dr¯ ı alle Avestan miriiete ¯ ¯ þraiio vispe Old Church Slavonic ı ˘ um˘retu tr˘je ı v˘si ı Latin moritur ¯ tres omnes ¯ Oscan ? trís súllus O. English 1 0 0 1 1 0 0 0 Cognacy classes OH German 1 1 0 1 1 0 0 0 for all: Avestan 0 0 1 1 0 1 0 0 1 {ealle, alle} OC Slavonic 0 0 1 1 0 1 0 0 2 {vispe, v˘si} ı Latin 0 0 1 1 0 0 1 0 3 ¯ {omnes} Oscan ? ? ? 1 0 0 0 1 4 {súllus} R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 12 / 81
  18. 18. Observation process Old English 1 0 0 1 1 0 0 0 Old High German 1 1 0 1 1 0 0 0 Avestan 0 0 1 1 0 1 0 0 Old Church Slavonic 0 0 1 1 0 1 0 0 Latin 0 0 1 1 0 0 1 0 Oscan ? ? ? 1 0 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 13 / 81
  19. 19. Observation process Old English 1 0 0 1 1 0 0 0 Old High German 1 1 0 1 1 0 0 0 Avestan 0 0 1 1 0 1 0 0 Old Church Slavonic 0 0 1 1 0 1 0 0 Latin 0 0 1 1 0 0 1 0 Oscan ? ? ? 1 0 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 13 / 81
  20. 20. Observation process Old English 1 0 1 1 0 Old High German 1 0 1 1 0 Avestan 0 1 1 0 1 Old Church Slavonic 0 1 1 0 1 Latin 0 1 1 0 0 Oscan ? ? 1 0 0 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 13 / 81
  21. 21. Constraints Constraints on the tree topology 30 constraints on the age of some nodes or ancient languages These constraits are used to estimate the evolution rates and the age. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 14 / 81
  22. 22. Constraints R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 15 / 81
  23. 23. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 16 / 81
  24. 24. Model (1): birth-death process Traits are born at rate λ Traits die at rate µ λ and µ are constant 1 1 0 0 0 0 0 0 0 2 1 0 1 0 0 0 0 0 3 1 0 0 0 0 0 0 1 4 0 0 0 0 1 0 0 0 5 0 0 0 0 1 0 0 0 6 1 1 0 0 0 1 1 0 7 1 1 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 17 / 81
  25. 25. Model (2): catastrophic rate heterogeneity Catastrophes occur at rate ρ At a catastrophe, each trait dies with probability κ and Poiss(ν) traits are born. λ/µ = ν/κ : the number of traits is constant on average. 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 1 0 0 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 0 0 0 0 1 1 0 0 0 4 0 0 0 0 1 0 0 0 0 0 0 0 0 0 5 0 0 0 0 1 0 0 0 0 0 0 0 0 0 6 1 0 0 0 0 1 1 0 0 0 0 0 1 0 7 1 0 0 0 0 1 0 0 0 0 0 0 1 0 8 1 0 0 0 0 0 0 0 0 0 0 0 1 0 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 18 / 81
  26. 26. Model (3): missing data Observation process: each point goes missing with probability ξi Some traits are not observed and are thinned out of the data 1 1000?00000?000 2 ?01000?000000? 3 0?00?000011000 4 0000?0?0000?00 5 00?01?00000000 6 10000??0?000?0 7 ?0000?0?000010 8 10000000000010 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 19 / 81
  27. 27. Observation process 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 20 / 81
  28. 28. Observation process 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 1 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 20 / 81
  29. 29. Observation process ? 1 0 0 ? 0 1 1 0 0 0 ? ? 1 0 0 1 1 ? 1 ? ? ? 1 ? 1 1 1 0 0 1 0 1 1 1 0 0 ? ? 1 1 1 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 21 / 81
  30. 30. Observation process ? 1 0 0 ? 0 1 1 0 0 0 ? ? 1 0 0 1 1 ? 1 ? ? ? 1 ? 1 1 1 0 0 1 0 1 1 1 0 0 ? ? 1 1 1 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 21 / 81
  31. 31. Observation process 1 0 ? 0 1 1 0 0 ? 1 0 0 1 1 1 ? ? 1 ? 1 1 0 1 0 1 1 1 0 ? 1 1 1 0 0 1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 22 / 81
  32. 32. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 23 / 81
  33. 33. TraitLab software Bayesian inference Markov Chain Monte Carlo (Almost) uniform prior over the age of the root R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 24 / 81
  34. 34. Why be Bayesian? In the settings described in this talk, it usually makes sense to use Bayesian inference, because: The models are complex Estimating uncertainty is paramount The output of one model is used as the input of another We are interested in complex functions of our parameters R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 25 / 81
  35. 35. Frequentist statistics Statistical inference deals with estimating an unknown parameter θ given some data D. In the frequentist view of statistics, θ has a true fixed (deterministic) value. Uncertainty is measured by confidence intervals, which are not intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20) for θ, I cannot say that there is a 95% probability that θ belongs to the interval [80 ; 120]. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 26 / 81
  36. 36. Frequentist statistics Statistical inference deals with estimating an unknown parameter θ given some data D. In the frequentist view of statistics, θ has a true fixed (deterministic) value. Uncertainty is measured by confidence intervals, which are not intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20) for θ, I cannot say that there is a 95% probability that θ belongs to the interval [80 ; 120]. Frequentist statistics often use the maximum likelihood estimator: for which value of θ would the data be most likely (under our model)? L(θ|D) = P[D|θ] ˆ θ = arg max L(θ|D) θ R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 26 / 81
  37. 37. Bayesian statistics In the Bayesian framework, the parameter θ is seen as inherently random: it has a distribution. Before I see any data, I have a prior distribution on π(θ), usually uninformative. Once I take the data into account, I get a posterior distribution, which is hopefully more informative. π(θ|D) ∝ π(θ)L(θ|D) Different people have different priors, hence different posteriors. But with enough data, the choice of prior matters little. We are now allowed to make probability statements about θ, such as "there is a 95% probability that θ belongs to the interval [78 ; 119]" (credible interval) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 27 / 81
  38. 38. Advantages and drawbacks of Bayesian statistics More intuitive interpretation of the results Easier to think about uncertainty In a hierarchical setting, it becomes easier to take into account all the sources of variability Prior specification: need to check that changing your prior does not change your result Computationally intensive R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 28 / 81
  39. 39. Prior and inference Parameter Prior Note on prior Method Tree g fG marginally uniform on MCMC root age, uniform on topologies Death rate µ 1/µ improper; invariant by MCMC scale change Birth rate λ 1/λ improper; invariant by integration scale change Birth time Z PPP Poisson process+ ob- integration servatoin process (pruning) Catastrophe time k PPP Total per edge MCMC Catastrophe rate ρ fR , Γ IC 95%: 1/tree – MCMC 1/edge Catastrophe death U(0, 1) MCMC rate κ Missing data rate ξ U(0, 1)L MCMC R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 29 / 81
  40. 40. Posterior distribution p(g, µ, λ, κ, ρ, ξ|D = D)   N 1 λ λ = exp − P[EZ |Z = (ti , i), g, µ, κ, ξ](1 − e−µ(tj −ti +ki TC ) ) N! µ µ i,j ∈E   N ×  P[M = ω|Z = (ti , i), g, µ](1 − e−µ(tj −ti +ki TC ) ) a=1 i,j ∈Ea ω∈Ωa L 1 e−ρ|g| (ρ|g|)kT × p(ρ)fG (g|T ) (1 − ξi )Qi ξiN−Qi µλ kT ! i=1 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 30 / 81
  41. 41. Likelihood calculation P[M = ω|Z = (ti , c), g, µ] = (c) ω∈Ωa (c)   δi,c ×  P[M = ω|Z = (tc , c), g, µ] if Y (Ωa ) ≥ 1  (c)  ω∈Ωa     (c) ( (1−δi,c )+δi,c × P[M=ω|Z=(tc , c), g, µ] if Y (Ωa ) = 0 and Q(Ωa  (c)    ω∈Ωa  (1 − δ ) + δ v  (0) (c) (c)   i,c i,c c if Y (Ωa ) + Q(Ωa ) = (c)  (i.e. Ωa = {∅})  (c)  1  if Ωa = {{c}, ∅} or {{c}} P[M = ω|Z = (tc , c), g, µ] = (i.e. Dc,a ∈ {?, 1}) (c) (c)  0 if Ωa = {∅} (i.e. Dc,a = 0)  ω∈Ωa R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 31 / 81
  42. 42. MCMC Fit the model to the data Trees that make the data likely Obtain a sample of trees and dates Samples weighted by quality of fit to data R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 32 / 81
  43. 43. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 33 / 81
  44. 44. Tests on synthetic data Figure: True tree, 40 words/language Figure: Consensus tree R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 34 / 81
  45. 45. Tests on synthetic data (2) Figure: Death rate (µ) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 35 / 81
  46. 46. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 36 / 81
  47. 47. Initial model: no catastrophes Traits are born at rate λ Traits die at rate µ λ and µ are constant 1 1 0 0 0 0 0 0 0 2 1 0 1 0 0 0 0 0 3 1 0 0 0 0 0 0 1 4 0 0 0 0 1 0 0 0 5 0 0 0 0 1 0 0 0 6 1 1 0 0 0 1 1 0 7 1 1 0 0 0 1 0 0 8 1 0 0 0 0 0 0 0 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 37 / 81
  48. 48. Mis-specification: catastrophic heterogeneity (a) (b) (c) (d) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 38 / 81
  49. 49. Influence of borrowing (1) Figure: True tree, 40 words/language, 10% Figure: Consensus tree d’emprunts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 39 / 81
  50. 50. Influence of borrowing (2) Figure: True tree, 40 words/language, 50% Figure: Consensus tree d’emprunts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 40 / 81
  51. 51. Influence of borrowing (3) The topology is reconstructed well Dates are under-estimated Figure: Root age Figure: Death rate (µ) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 41 / 81
  52. 52. Presence of borrowing? 1 0.9 0.8 Ringe 100 b=0 b=0.1 0.7 b=0.5 b=1 0.6 0.5 0.4 2 4 6 8 10 12 14 16 18 20 22 24 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 42 / 81
  53. 53. Mis-specifications Heterogeneity between traits Analyse subset of data+ sim- ulated data Heterogeneity in time/space Simulated data analysis with (non catastrophic) edge rate from a Γ distribution Borrowing Simulated data analysis + check level of borrowing Data missing in blocks Simulated data analysis Non-empty meaning cate- Simulated data analysis gories R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 43 / 81
  54. 54. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 44 / 81
  55. 55. Data Indo-European languages Core vocabulary (Swadesh 100 ou 207) Two (almost) independent data sets Dyen et al. (1997) : 87 languages, mostly modern Ringe et al. (2002) : 24 languages, mostly ancient R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 45 / 81
  56. 56. Cross-validation Predict age of nodes for which we have a constraint: would we reject the truth? Γ space of trees which respect all constraints Γ−c : remove constraint c = 1 . . . 30 M0 : g ∈ Γ, M1 ; g ∈ Γ−c . Bayes factor: P[g ∈ Γ|D, g ∈ Γ−c ] B (c) = P[g ∈ Γ|Γ−c ] Constraint c conflicts with the model if 2 log B (c) < −5. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 46 / 81
  57. 57. Cross validation 100 10 5 2 0 −2 −5 −10 −100 HI TA TB LU LY OI UM OS LA GK AR GO ON OE OG OS PR AV PE VE CE IT GE WG NW BS BA IR II TG 0 2000 4000 6000 8000 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 47 / 81
  58. 58. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 48 / 81
  59. 59. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 49 / 81
  60. 60. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 50 / 81
  61. 61. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 51 / 81
  62. 62. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 52 / 81
  63. 63. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 53 / 81
  64. 64. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 54 / 81
  65. 65. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 55 / 81
  66. 66. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 56 / 81
  67. 67. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 57 / 81
  68. 68. Consensus tree: modern languages (Dyen data) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 58 / 81
  69. 69. Consensus tree; ancient languages (Ringe data) oldhighgerman oldenglish oldnorse gothic oscan umbrian 66 latin welsh oldirish 85 oldpersian avestan vedic 58 lithuanian latvian oldprussian oldcslavonic greek 78 armenian lycian luvian hittite 62 tocharian_b tocharian_a albanian 8000 7000 6000 5000 4000 3000 2000 1000 0 R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 59 / 81
  70. 70. Root age R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 60 / 81
  71. 71. Conclusions Strong support for Anatolian farming hypothesis: root around 8000 BP Statistics reconstruct known linguistic facts and answer unresolved questions TraitLab: it’s free! (Though Matlab is not...) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 61 / 81
  72. 72. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 62 / 81
  73. 73. Semitic lexical data Data: Kitchen et al. (2009) 25 languages, 96 meanings, 674 cognacy classes Questions of interest: root age (constraint known), topology, outgroup R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 63 / 81
  74. 74. Model validation Thin bar: constraint. Thick bar: 95% posterior HPD. (Red bar: 95% prior HPD) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 64 / 81
  75. 75. Model validation R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 65 / 81
  76. 76. Conclusions Root age 95% HPD: 4400 – 5100 BP Akkadian outgroup: 67% (Syrian homeland?) Zero catastrophes: 33% R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 66 / 81
  77. 77. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 67 / 81
  78. 78. Back to Bergsland and Vogt Norse family, 8 languages. Selection bias Claim that the rate of change is significantly different for these data. B&V included words used only in literary Icelandic, which we exclude We can handle polymorphism Do not include catastrophes R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 68 / 81
  79. 79. Known history Gjestal Sandnes Riksmal X XI XII XIII Icelandic R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 69 / 81
  80. 80. Tests Two possible ways to test whether the same model parameters apply to this example and to Indo-European: 1 Assume parameters are the same as for the general Indo-European tree, and estimate ancestral ages. 2 Use Norse constraints to estimate parameters, and compare to parameter estimates from general Indo-European tree R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 70 / 81
  81. 81. Results If we use parameter values from another analysis, we can try to estimate the age of 13th century Norse. True constraint: 660–760 BP. Our HPD: 615 – 872 BP. If we analyse the Norse data on its own, we estimate parameters. Value of µ for Norse: 2.47 ± 0.4 · 10−4 Value of µ for IE: 1.86 ± 0.39 · 10−4 (Dyen), 2.37 ± 0.21 · 10−4 (Ringe) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 71 / 81
  82. 82. But... We can also try to estimate the age of Icelandic (which is 0 BP) Find 439–560 BP, far from the true value B&V were right: there was significantly less change on the branch leading to Icelandic than average However, we are still able to estimate internal node ages. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 72 / 81
  83. 83. Georgian Second data set: Georgian and Mingrelian Age of ancestor: last millenium BC Code data given by B&V, discarding borrowed items Use rate estimate from Ringe et al. analysis R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 73 / 81
  84. 84. Georgian Second data set: Georgian and Mingrelian Age of ancestor: last millenium BC Code data given by B&V, discarding borrowed items Use rate estimate from Ringe et al. analysis 95% HPD: 2065 – 3170 BP R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 73 / 81
  85. 85. B&V: conclusions Third data set (Armenian) not clear enough to be recoded. There is variation in the number of changes on an edge Nonetheless, we are still able to estimate ancestral language age Variation in borrowing rates B& V: "we cannot estimate dates, and it follows that we cannot estimate the topology either". We can estimate dates, and even if we couldn’t, we might still be able to estimate the topology R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 74 / 81
  86. 86. Outline 1 Data 2 Model 3 Inference 4 In-model validation 5 Model mis-specification 6 Results 7 Semitic lexical data 8 Bergsland and Vogt 9 Punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 75 / 81
  87. 87. Atkinson et al. (2008) Hypothesis: when a language is founded by a migration, the founder effect leads to fast change over a short period of time. There is a catastrophe at each branching event. Indirect estimation: correlation between number of changes between root and leaf, and number of branching events along the same path Atkinson: 21% of changes in the history of IE are due to punctuational bursts R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 76 / 81
  88. 88. Atkinson et al. (2008) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 77 / 81
  89. 89. Direct analysis We force a catastrophe on each edge. Infer size of catastrophes. Find κ very close to 0. Less than 1% of change can be attributed to punctuational bursts. Reason for discrepancy unclear. R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 78 / 81
  90. 90. Conclusions Strong support for age of PIE around 8000 BP Statistical methods can help answer questions which traditional methods cannot Many more questions and models to come TraitLab: it’s free! (although Matlab is not...) R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 79 / 81
  91. 91. Questions otázky kesses spørgsmåler cwestiwnau pytania preguntes preguntas vrae kláusimai Fragen voprosy quaestiones ˘ întrebari questions vragen ρωτ η σ ις ´ zapitanni spurningar domande spørsmåler questões frågor vprašanja R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 80 / 81
  92. 92. References R. J. Ryder & G. K. Nicholls, Missing data in a stochastic Dollo model for cognate data, and its application to the dating of Proto-Indo-European (2011), JRSS C G. K. Nicholls, Horses or farmers? The tower of Babel and confidence in trees (2008), Significance (popular science) G. K. Nicholls & R. J. Ryder, Phylogenetic models for Semitic vocabulary (2011), IWSM R. J. Ryder, Phylogenetic Models of Language Diversification (2010), DPhil. thesis, University of Oxford R. Ryder & G. Nicholls (Dauphine & Oxford) Language phylogenies UCLA 2013 81 / 81

×