Successfully reported this slideshow.
Your SlideShare is downloading. ×

Statistical Methods in Historical Linguistics

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 104 Ad

More Related Content

Similar to Statistical Methods in Historical Linguistics (20)

Advertisement

Recently uploaded (20)

Statistical Methods in Historical Linguistics

  1. 1. Statistical methods for Historical Linguistics Robin J. Ryder Centre de Recherche en Math´matiques de la D´cision e e Universit´ Paris-Dauphine e 7 March 2013 UCLA Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 1 / 98
  2. 2. Introduction A large number of recent papers describe computationally-intensive statistical methods for Historical Linguistics Increased computational power Advances in statistical methodology Complex linguistic questions which cannot be answered with traditional methods Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 2 / 98
  3. 3. Caveats I am not a linguist I am a statistician These papers were not written by me; most figures were created by the papers’ authors I use the word ”evolution” in a broad sense ”All models are wrong, but some are useful” Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 3 / 98
  4. 4. Aims of this talk Review of several recent papers on statistical models for Historical Linguistics Statisticians won’t replace linguists When done correctly, collaborations between statisticians and linguists can provide useful results Brief introduction to some major statistical concepts + statistical comments Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 4 / 98
  5. 5. Statistics slides Different colour Feel free to ignore the equations Useful (essential?) if you wish to collaborate with statisticians Limitations of statistical tools Statistics should not be a black box Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 5 / 98
  6. 6. Advantages of statistical methods Analyse (very) large datasets Test multiple hypotheses Cross-validation Estimate uncertainty Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 6 / 98
  7. 7. Statistical method in a nutshell 1 Collect data 2 Design model 3 Perform inference (MCMC, ...) 4 Check convergence 5 In-model validation (is our inference method able to answer questions from our model?) 6 Model mis-specification analysis (do we need a more complex model?) 7 Conclude In general, it is more difficult to perform inference for a more complex model. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 7 / 98
  8. 8. Contents 1 Swadesh: glottochronology 2 Gray and Atkinson: Language phylogenies 3 Lieberman et al., Pagel et al.: Frequency of use 4 Perreault and Mathew, Atkinson: Phoneme expansion 5 Dunn et al.: Word-order universals 6 Bouchard-Cˆt´ et al: automated cognates oe 7 Overall conclusions Tomorrow: phylogenetics for core vocabulary Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 8 / 98
  9. 9. Outline 1 Swadesh: glottochronology 2 Gray and Atkinson: Language phylogenies 3 Lieberman et al., Pagel et al.: Frequency of use 4 Perreault and Mathew, Atkinson: Phoneme expansion 5 Dunn et al.: Word-order universals 6 Bouchard-Cˆt´ et al: automated cognates oe 7 Overall conclusions Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 9 / 98
  10. 10. Swadesh (1952) LEXICO-STATISTIC DATING OF PREHISTORIC ETHNIC CONTACTS WithSpecialReference North to American Indiansand Eskimos MORRIS SWADESH PREHISTORY refersto the long period of early measuring amountof radioactivity going the still humansociety before writing was availableforthe on. Consequently, is possible to determine it recording events. In a fewplaces it gives way of within certainlimitsof accuracythetimedepthof to the modernepoch of recorded history much as any archeological whichcontains suitablebit site a as six or eightthousand yearsago; in manyareas of bone, wood, grass, or any otherorganic sub- this happened only in the last few centuries. stance. Everywhere prehistory represents greatobscure a Lexicostatistic dating makes use of very dif- depthwhichscienceseeks to penetrate. And in- ferent materialfromcarbondating, but the broad deed powerful means have been foundforillumi- theoretical prin~iple similar. Researchesby the is natingthe unrecorded past,including evidence the presentauthorand several otherscholarswithin of archeological findsand that of the geographic the last few years have revealedthat the funda- distribution culturalfactsin the earliestknown of mentaleveryday vocabularyof any language-as periods. Much dependson thepainstaking analy- againstthe specializedor "cultural"vocabulary- sis and comparison data, and on the effective of changes at a relatively constantrate. The per- readingof theirimplications. Very important is centage of retainedelementsin a suitable test thecombined of all theevidence, use linguistic and vocabularytherefore indicatesthe elapsed time. ethnographic well as archeological, as biological, a Wherever speechcommunity comesto be divided and geological. And it is essentialconstantly to into two or more parts so that linguistic change seek new means of expandingand rendering more goes separatewaysin each of thenew speechcom- accurateour deductions about prehistory. munities, percentage commonretainedvo- the of One of the mostsignificant recenttrendsin the cabulary gives an index of theamountof timethat Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 10 / 98
  11. 11. Question at hand Aim: dating the MRCA (Most Recent Common Ancestor) of a pair of languages. Data: ”core vocabulary” (Swadesh lists). 215 or 100 words. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 11 / 98
  12. 12. all dog good live right spit two and drink grass liver (side) split vomit animal dry green long river road squeeze walk ashes dull guts louse at root stab warm dust hair man back rope stand ear hand many wash bad rotten star earth he meat water bark round stick eat head moon rub stone we egg hear because mother salt wet eye heart belly sand straight fall heavy suck what big mountain say far here sun when bird mouth fat hit scratch swell bite name where father hold sea swim black narrow fear horn white blood near see tail how seed ten who blow neck feather hunt bone new sew that wide few breast night sharp there fight husband wife nose short they breathe fire I wind not sing thick burn fish ice wing old sit thin child five if one skin think wipe claw float in other sky this flow kill with cloud sleep thou person cold flower knee play small three fly know woman come smell throw pull Robin count fog J. Ryder (Paris Dauphine) lake Statistics and Historical Linguistics smoke UCLA March 2013 woods 98 12 / tie
  13. 13. Assumptions Swadesh assumed that core vocabulary evolves at a constant rate (through time, space and meanings). Given a pair of languages with percentage C of shared cognates, and a constant retention rate r , the age t of the MRCA is log C t= 2 log r The constant r was estimated using a pair of languages for which the age of the MRCA is known. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 13 / 98
  14. 14. Issues with glottochronology Many statistical shortcomings. Mainly: 1 Simplistic model 2 No evaluation of uncertainty of estimates 3 Only small amounts of data are used See Bergsland and Vogt (1962) for debunking of glottochronology. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 14 / 98
  15. 15. What has changed? More elaborate models + model misspecification analyses We can estimate the uncertainty (⇒easier to answer ”I don’t know”) Large amounts of data Tomorrow: new analysis of Bergsland and Vogt’s data. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 15 / 98
  16. 16. Outline 1 Swadesh: glottochronology 2 Gray and Atkinson: Language phylogenies 3 Lieberman et al., Pagel et al.: Frequency of use 4 Perreault and Mathew, Atkinson: Phoneme expansion 5 Dunn et al.: Word-order universals 6 Bouchard-Cˆt´ et al: automated cognates oe 7 Overall conclusions Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 16 / 98
  17. 17. nature of effusive silicic lavas , and with textural observations at Competing interests statement The authors declare that they have no competing financial the outcrop scale down to the microscale9–11 (Fig. 1). As opposed to interests. the common view that explosive volcanism “is defined as involving Gray and Atkinson fragmentation of magma during ascent”1, we conclude that frag- mentation may play an equally important role in reducing the Correspondence and requests for materials should be addressed to H.M.G. (hmg@seismo.berkeley.edu). likelihood of explosive behaviour, by facilitating magma degassing. Because shear-induced fragmentation depends so strongly on the rheology of the ascending magma, our findings are in a broader sense equivalent to Eichelberger’s hypothesis1 that “higher viscosity .............................................................. of magma may favour non-explosive degassing rather than hinder it”, albeit with the added complexity of shear-induced Language-tree divergence times fragmentation. A support the Anatolian theory Received 19 May; accepted 15 November 2003; doi:10.1038/nature02138. 1. Eichelberger, J. C. Silicic volcanism: ascent of viscous magmas from crustal reservoirs. Annu. Rev. of Indo-European origin Earth Planet. Sci. 23, 41–63 (1995). 2. Dingwell, D. B. Volcanic dilemma: Flow or blow? Science 273, 1054–1055 (1996). Russell D. Gray & Quentin D. Atkinson 3. Papale, P. Strain-induced magma fragmentation in explosive eruptions. Nature 397, 425–428 (1999). 4. Dingwell, D. B. & Webb, S. L. Structural relaxation in silicate melts and non-Newtonian melt rheology Department of Psychology, University of Auckland, Private Bag 92019, in geologic processes. Phys. Chem. Miner. 16, 508–516 (1989). 5. Webb, S. L. & Dingwell, D. B. The onset of non-Newtonian rheology of silcate melts. Phys. Chem. Auckland 1020, New Zealand ............................................................................................................................................................................. Miner. 17, 125–132 (1990). 6. Webb, S. L. & Dingwell, D. B. Non-Newtonian rheology of igneous melts at high stresses and strain Languages, like genes, provide vital clues about human history1,2. rates: experimental results for rhyolite, andesite, basalt, and nephelinite. J. Geophys. Res. 95, The origin of the Indo-European language family is “the most 15695–15701 (1990). intensively studied, yet still most recalcitrant, problem of his- 7. Newman, S., Epstein, S. & Stolper, E. Water, carbon dioxide and hydrogen isotopes in glasses from the ca. 1340 A.D. eruption of the Mono Craters, California: Constraints on degassing phenomena and torical linguistics”3. Numerous genetic studies of Indo-European initial volatile content. J. Volcanol. Geotherm. Res. 35, 75–96 (1988). origins have also produced inconclusive results4,5,6. Here we 8. Villemant, B. & Boudon, G. Transition from dome-forming to plinian eruptive styles controlled by analyse linguistic data using computational methods derived H2O and Cl degassing. Nature 392, 65–69 (1998). 9. Polacci, M., Papale, P. & Rosi, M. Textural heterogeneities in pumices from the climactic eruption of from evolutionary biology. We test two theories of Indo- Mount Pinatubo, 15 June 1991, and implications for magma ascent dynamics. Bull. Volcanol. 63, European origin: the ‘Kurgan expansion’ and the ‘Anatolian 83–97 (2001). farming’ hypotheses. The Kurgan theory centres on possible 10. Tuffen, H., Dingwell, D. B. & Pinkerton, H. Repeated fracture and healing of silicic magma generates archaeological evidence for an expansion into Europe and the flow banding and earthquakes? Geology 31, 1089–1092 (2003). 11. Stasiuk, M. V. et al. Degassing during magma ascent in the Mule Creek vent (USA). Bull. Volcanol. 58, Near East by Kurgan horsemen beginning in the sixth millen- 117–130 (1996). nium BP 7,8. In contrast, the Anatolian theory claims that Indo- 12. Goto, A. A new model for volcanic earthquake at Unzen Volcano: Melt rupture model. Geophys. Res. European languages expanded with the spread of agriculture Lett. 26, 2541–2544 (1999). from Anatolia around 8,000–9,500 years BP 9. In striking agree- 13. Mastin, L. G. Insights into volcanic conduit flow from an open-source numerical model. Geochem. Geophys. Geosyst. 3, doi:10.1029/2001GC000192 (2002). ment with the Anatolian hypothesis, our analysis of a matrix of 14. Proussevitch, A. A., Sahagian, D. L. & Anderson, A. T. Dynamics of diffusive bubble growth in 87 languages with 2,449 lexical items produced an estimated age magmas: Isothermal case. J. Geophys. Res. 3, 22283–22307 (1993). range for the initial Indo-European divergence of between 7,800 15. Lensky, N. G., Lyakhovsky, V. & Navon, O. Radial variations of melt viscosity around growing bubbles and gas overpressure in vesiculating magmas. Earth Planet. Sci. Lett. 186, 1–6 (2001). and 9,800 years BP. These results were robust to changes in coding 16. Rust, A. C. & Manga, M. Effects of bubble deformation on the viscosity of dilute suspensions. procedures, calibration points, rooting of the trees and priors in J. Non-Newtonian Fluid Mech. 104, 53–63 (2002). the bayesian analysis. NATURE | VOL 426 | 27 NOVEMBER 2003 | www.nature.com/nature © 2003 Nature Publishing Group 435 Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 17 / 98
  18. 18. Swadesh lists, better analysed Use Swadesh lists for 87 Indo-European languages, and a phylogenetic model from Genetics Assume a tree-like model of evolution Bayesian inference via MCMC (Markov Chain Monte Carlo) Reconstruct trees and dates Main parameter of interest: age of the root (Proto-Indo-European, PIE) Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 18 / 98
  19. 19. Lexical trees Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 19 / 98
  20. 20. Correcting the issues with glottochronology Returning to the issues with glottochronology: 1 Simplistic model → Slightly better, but the model of evolution is rudimentary 2 No evaluation of uncertainty of estimates → Bayesian inference 3 Only small amounts of data are used → Large number of languages reduces variability of estimates Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 20 / 98
  21. 21. Why be Bayesian? In the settings described in this talk, it usually makes sense to use Bayesian inference, because: The models are complex Estimating uncertainty is paramount The output of one model is used as the input of another We are interested in complex functions of our parameters Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 21 / 98
  22. 22. Frequentist statistics Statistical inference deals with estimating an unknown parameter θ given some data D. In the frequentist view of statistics, θ has a true fixed (deterministic) value. Uncertainty is measured by confidence intervals, which are not intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20) for θ, I cannot say that there is a 95% probability that θ belongs to the interval [80 ; 120]. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 22 / 98
  23. 23. Frequentist statistics Statistical inference deals with estimating an unknown parameter θ given some data D. In the frequentist view of statistics, θ has a true fixed (deterministic) value. Uncertainty is measured by confidence intervals, which are not intuitive to interpret: if I get a 95% CI of [80 ; 120] (i.e. 100 ± 20) for θ, I cannot say that there is a 95% probability that θ belongs to the interval [80 ; 120]. Frequentist statistics often use the maximum likelihood estimator: for which value of θ would the data be most likely (under our model)? L(θ|D) = P[D|θ] ˆ θ = arg max L(θ|D) θ Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 22 / 98
  24. 24. Bayesian statistics In the Bayesian framework, the parameter θ is seen as inherently random: it has a distribution. Before I see any data, I have a prior distribution on π(θ), usually uninformative. Once I take the data into account, I get a posterior distribution, which is hopefully more informative. π(θ|D) ∝ π(θ)L(θ|D) Different people have different priors, hence different posteriors. But with enough data, the choice of prior matters little. We are now allowed to make probability statements about θ, such as ”there is a 95% probability that θ belongs to the interval [78 ; 119]” (credible interval) Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 23 / 98
  25. 25. Advantages and drawbacks of Bayesian statistics More intuitive interpretation of the results Easier to think about uncertainty In a hierarchical setting, it becomes easier to take into account all the sources of variability Prior specification: need to check that changing your prior does not change your result Computationally intensive Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 24 / 98
  26. 26. Effect of prior Example: Bernoulli model (biased coin). θ=probability of success. Observe y = 72 successes out of n = 100 trials. ˆ Frequentist estimate: θ = 0.72 Confidence interval: [0.63 0.81]. Bayesian estimate: will depend on the prior. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 25 / 98
  27. 27. Effect of prior prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x y = 72, n = 100 Black:prior; green: likelihood; red: posterior. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 26 / 98
  28. 28. Effect of prior prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x prior(x) 8 4 0 0.0 0.2 0.4 0.6 0.8 1.0 x y = 7, n = 10 Black:prior; green: likelihood; red: posterior. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 27 / 98
  29. 29. Effect of prior prior(x) 25 0 10 0.0 0.2 0.4 0.6 0.8 1.0 x 25 prior(x) 0 10 0.0 0.2 0.4 0.6 0.8 1.0 x 25 prior(x) 0 10 0.0 0.2 0.4 0.6 0.8 1.0 x y = 721, n = 1000 Black:prior; green: likelihood; red: posterior. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 28 / 98
  30. 30. Bayesian inference for lexical trees The tree parameter is seen as random: it has a distribution Via MCMC, G & A get a sample of possible trees, with associated probabilities, rather than a single tree The uncertainty in trees is thus made explicit Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 29 / 98
  31. 31. G & A: Assumptions Tree-like evolution Genetics model Independence of cognates Constant rate of change Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 30 / 98
  32. 32. G & A: conclusions Age of PIE: 7800-9800 BP (Before Present) Large error bars, but this is a good thing Reconstruct many known features of the tree of Indo-European languages Little validation of the model, no model misspecification analysis Much more on these data and more elaborate models and analyses tomorrow These trees can also be used as a building block to answer other questions Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 31 / 98
  33. 33. Outline 1 Swadesh: glottochronology 2 Gray and Atkinson: Language phylogenies 3 Lieberman et al., Pagel et al.: Frequency of use 4 Perreault and Mathew, Atkinson: Phoneme expansion 5 Dunn et al.: Word-order universals 6 Bouchard-Cˆt´ et al: automated cognates oe 7 Overall conclusions Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 32 / 98
  34. 34. Lieberman et al. (2007) Vol 449 | 11 October 2007 | doi:10.1038/nature06137 LETTERS Quantifying the evolutionary dynamics of language Erez Lieberman1,2,3*, Jean-Baptiste Michel1,4*, Joe Jackson1, Tina Tang1 & Martin A. Nowak1 Human language is based on grammatical rules1–4. Cultural evolu- applied to three-quarters of all Old English verbs21. The excep- tion allows these rules to change over time5. Rules compete with tions—ancestors of the modern irregular verbs—were mostly each other: as new rules rise to prominence, old ones die away. To members of the so-called ‘strong’ verbs. There are seven different quantify the dynamics of language evolution, we studied the regu- classes of strong verbs with exemplars among the Modern English larization of English verbs over the past 1,200 years. Although an irregular verbs, each with distinguishing markers that often include elaborate system of productive conjugations existed in English’s characteristic vowel shifts. Although stable coexistence of multiple proto-Germanic ancestor, Modern English uses the dental suffix, rules is one possible outcome of rule dynamics, this is not what ‘-ed’, to signify past tense6. Here we describe the emergence of this occurred in English verb inflection22. We therefore define regularity linguistic rule amidst the evolutionary decay of its exceptions, with respect to the modern ‘-ed’ rule, and call all these exceptional known to us as irregular verbs. We have generated a data set of forms ‘irregular’. verbs whose conjugations have been evolving for more than a We consulted a large collection of grammar textbooks describing millennium, tracking inflectional changes to 177 Old-English verb inflection in these earlier epochs, and hand-annotated every irregular verbs. Of these irregular verbs, 145 remained irregular irregular verb they described (see Supplementary Information). in Middle English and 98 are still irregular today. We study how This provided us with a list of irregular verbs from ancestral the rate of regularization depends on the frequency of word usage. forms of English. By eliminating verbs that were no longer part of The half-life of an irregular verb scales as the square root of its Modern English, we compiled a list of 177 Old English irregular usage frequency: a verb that is 100 times less frequent regularizes verbs that remain part of the language to this day. Of these 177 Old 10 times as fast. Our study provides a quantitative analysis of the English irregulars, 145 remained irregular in Middle English and 98 regularization process by which ancestral forms gradually yield to are still irregular in Modern English. Verbs such as ‘help’, ‘grip’ and an emerging linguistic rule. ‘laugh’, which were once irregular, have become regular with the Natural languages comprise elaborate systems of rules that enable Robin J. Ryder (Paris Dauphine) passing of time. 7 Statistics and Historical Linguistics UCLA March 2013 33 / 98
  35. 35. Question at hand Wish to verify that the rate of change tends to be slower for words which are used more frequently Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 34 / 98
  36. 36. Data Regularization of irregular verbs Old English, Middle English and Modern English (1200 years of evolution) 177 irregular verbs, of which 79 regularize Frequency of use Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 35 / 98
  37. 37. Results Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 36 / 98
  38. 38. Conclusions Regularization rate changes as the square root of usage frequency A verb used 4 times more often regularizes at half the rate Evolutionary history known Sub-optimal: binning Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 37 / 98
  39. 39. Pagel et al. Vol 449 | 11 October 2007 | doi:10.1038/nature06176 LETTERS Frequency of word-use predicts rates of lexical evolution throughout Indo-European history Mark Pagel1,2, Quentin D. Atkinson1 & Andrew Meade1 ´ Greek speakers say ‘‘oura’’, Germans ‘‘schwanz’’ and the French paired meanings in the Bantu languages. This indicates that variation ‘‘queue’’ to describe what English speakers call a ‘tail’, but all of in the rates of lexical replacement among meanings is not merely an these languages use a related form of ‘two’ to describe the number historical accident, but rather is linked to some general process of after one. Among more than 100 Indo-European languages and language evolution. dialects, the words for some meanings (such as ‘tail’) evolve Social and demographic factors proposed to affect rates of rapidly, being expressed across languages by dozens of unrelated language change within populations of speakers include social status11, words, while others evolve much more slowly—such as the num- the strength of social ties12, the size of the population13 and levels of ber ‘two’, for which all Indo-European language speakers use the outside contact14. These forces may influence rates of evolution on a same related word-form1. No general linguistic mechanism has local and temporally specific scale, but they do not make general been advanced to explain this striking variation in rates of lexical predictions across language families about differences in the rate of replacement among meanings. Here we use four large and diver- lexical replacement among meanings. Drawing on concepts from gent language corpora (English2, Spanish3, Russian4 and Greek5) theories of molecular15 and cultural evolution16–18, we suggest that and a comparative database of 200 fundamental vocabulary mean- the frequency with which different meanings are used in everyday ings in 87 Indo-European languages6 to show that the frequency language may affect the rate at which new words arise and become with which these words are used in modern language predicts their adopted in populations of speakers. If frequency of meaning-use is rate of replacement over thousands of years of Indo-European a shared and stable feature of human languages, then this could language evolution. Across all 200 meanings, frequently used provide a general mechanism to explain the large differences across words evolve at slower rates and infrequently used words evolve meanings in observed rates of lexical replacement. Here we test this Robin J. Ryder (Paris Dauphine) holds separately and identically idea by examining the relationship between the rates at which Indo- more rapidly. This relationship Statistics and Historical Linguistics UCLA March 2013 38 / 98
  40. 40. Question at hand Check link between frequency of use and rate of change for vocabulary. Hypothesis: when a meaning is used more often, the corresponding word has less chances of changing. Problem: since this rate is expected to be much slower than that of strong verb regularization, we need to look at much deeper history. But then the evolutionary history is unknown. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 39 / 98
  41. 41. Workaround Use Indo-European core vocabulary data, and frequencies from English, Greek, Russian and Spanish Get a sample from the distribution on trees and ancestral ages using G&A’s method For each tree in the sample, estimate the rate of change for each meaning. Average across all trees. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 40 / 98
  42. 42. Results Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 41 / 98
  43. 43. Comments Even if there is high uncertainty in the phylogenies, we can still answer other questions (integrating out the tree) Similar results for Bantu (Pagel and Meade 2006) Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 42 / 98
  44. 44. Sampling from a distribution Given a parameter θ with probability density π(θ), we often wish to estimate the expected value E [θ], or the probability mass of an interval P[a < θ < b]. If we can sample n independent and identically distributed (iid) values θ1 , . . . , θn from π(·), then (under general assumptions) we have a good estimated of E [θ]: n 1 θi −→ E [θ] n n→∞ i=1 (and the same holds for Ih = h(θ) dθ for any measurable function h). Convergence is fast. But in general, getting an iid sample is difficult or impossible. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 43 / 98
  45. 45. MCMC If we are able to compute the value of the density π(θ) at any point θ, we can use MCMC (Markov Chain Monte Carlo) to generate correlated θ1 , . . . , θn from π(·). The convergence still holds: n 1 θi −→ E [θ] n n→∞ i=1 but is much slower. Since we use a finite value of n, we need to check the convergence of the algorithm. In general, this is done by: 1 Checking the trace of the algorithm 2 Checking the autocorrelations 3 Running the algorithm for a very large value of n Curse of dimensionality: when the dimensionality of the parameter space increases, convergence slows dramatically. More during the TraitLab tutorial. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 44 / 98
  46. 46. ABC If we are not able to compute π(θ), approximate inference may still be possible using a likelihood-free method such as Approximate Bayesian Computation (ABC). The idea is to draw many values of the parameters; for each value, simulate a new dataset; keep only parameter values which lead to simulated data ”close” to the observed data. The theoretical properties are still mostly unknown. This method induces an approximation, and it is difficult to control the size of the approximation. But sometimes, this is the only possibility. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 45 / 98
  47. 47. Outline 1 Swadesh: glottochronology 2 Gray and Atkinson: Language phylogenies 3 Lieberman et al., Pagel et al.: Frequency of use 4 Perreault and Mathew, Atkinson: Phoneme expansion 5 Dunn et al.: Word-order universals 6 Bouchard-Cˆt´ et al: automated cognates oe 7 Overall conclusions Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 46 / 98
  48. 48. Perreault and Mathew Dating the Origin of Language Using Phonemic Diversity Charles Perreault1., Sarah Mathew2*. 1 Santa Fe Institute, Santa Fe, New Mexico, United States of America, 2 Centre for the Study of Cultural Evolution, Stockholm University, Stockholm, Sweden Abstract Language is a key adaptation of our species, yet we do not know when it evolved. Here, we use data on language phonemic diversity to estimate a minimum date for the origin of language. We take advantage of the fact that phonemic diversity evolves slowly and use it as a clock to calculate how long the oldest African languages would have to have been around in order to accumulate the number of phonemes they possess today. We use a natural experiment, the colonization of Southeast Asia and Andaman Islands, to estimate the rate at which phonemic diversity increases through time. Using this rate, we estimate that present-day languages date back to the Middle Stone Age in Africa. Our analysis is consistent with the archaeological evidence suggesting that complex human behavior evolved during the Middle Stone Age in Africa, and does not support the view that language is a recent adaptation that has sparked the dispersal of humans out of Africa. While some of our assumptions require testing and our results rely at present on a single case-study, our analysis constitutes the first estimate of when language evolved that is directly based on linguistic data. Citation: Perreault C, Mathew S (2012) Dating the Origin of Language Using Phonemic Diversity. PLoS ONE 7(4): e35289. doi:10.1371/journal.pone.0035289 Editor: Michael D. Petraglia, University of Oxford, United Kingdom Received September 8, 2011; Accepted March 14, 2012; Published April 27, 2012 Copyright: ß 2012 Perreault, Mathew. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Funding was provided by the Swedish Research Council [grant number 2009-2390]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: sarah.mathew@arklab.su.se . These authors contributed equally to this work. Introduction of the globe to be colonized. The loss of phonemes through serial founder effect is consistent with other lines of evidence that A capacity for language is a hallmark of our species [1,2], yet we indicate that phonemic diversity is determinedMarch 2013 Robin J. Ryderlittle about Dauphine)its appearance. Language appears in know (Paris the timing of Statistics and Historical Linguistics UCLA by cultural 47 / 98
  49. 49. Aim Date Proto-Human. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 48 / 98
  50. 50. Premises Two small populations B and C disperse from the same parent A. B colonizes a large territory, C a small isolated island. B will tend to accumulate new phonemes at a faster rate than C (PB > PC ). If the island is small enough, C will have about the same phonemic diversity as the ancestral language A (E (PC ) = PA ). The difference in modern-day phonemic diversity is due entirely to B, so if we know the splitting time t, we can estimate the rate of phoneme accumulation in a large territory. Two models of phoneme accumulation: linear and exponential. PB − PC ln(PB ) − ln(PC ) r= k= t t Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 49 / 98
  51. 51. Idea Estimate r or k Estimate modern phonemic diversity in Africa (PAfrica ) and guess initial phonemic diversity Pinitial (take Pinitial = 11 or 29, minimum and median modern phonemic diversity). Then we can deduce the age of the initial language. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 50 / 98
  52. 52. Parameter estimation Single natural experiment: Andaman islands (colonized 45-65 kya). Estimate PB and PC by averaging over 20 languages. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 51 / 98
  53. 53. Results (Linear/exponential model; 2 estimates for colonization of Andaman islands; 2 guesses of Pinitial ). Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 52 / 98
  54. 54. Comments Highly dependent on strong assumptions Needs model validation The parameter estimates are very noisy, since they are based on related languages. Intriguing Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 53 / 98
  55. 55. 1. N. C. Seeman, J. Theor. Biol. 99, 237 (1982). 16. S. M. Douglas et al., Nature 459, 414 (2009). Center for Bio-Inspired Solar Fuel Production, an 2. N. C. Seeman, Nature 421, 427 (2003). 17. Y. Ke et al., J. Am. Chem. Soc. 131, 15903 (2009). Energy Frontier Research Center funded by the U.S. Atkinson 3. 4. 5. T. J. Fu, N. C. Seeman, Biochemistry 32, 3211 (1993). J. H. Chen, N. C. Seeman, Nature 350, 631 (1991). E. Winfree, F. Liu, L. A. Wenzler, N. C. Seeman, 18. H. Dietz, S. M. Douglas, W. M. Shih, Science 325, 725 (2009). 19. Materials and methods are available as supporting Department of Energy, Office of Science, Office of Basic Energy Sciences under award DE-SC0001016. Nature 394, 539 (1998). material on Science Online. 6. H. Yan, S. H. Park, G. Finkelstein, J. H. Reif, T. H. LaBean, 20. P. W. K. Rothemund et al., J. Am. Chem. Soc. 126, Supporting Online Material 16344 (2004). www.sciencemag.org/cgi/content/full/332/6027/342/DC1 Science 301, 1882 (2003). Materials and Methods 7. W. M. Shih, J. D. Quispe, G. F. Joyce, Nature 427, 618 21. P. J. Hagerman, Annu. Rev. Biophys. Chem. 17, 265 SOM Text (2004). (1988). Figs. S1 to S39 8. F. Mathieu et al., Nano Lett. 5, 661 (2005). Acknowledgments: We thank the AFM applications group Tables S1 to S19 9. R. P. Goodman et al., Science 310, 1661 (2005). at Bruker Nanosurfaces for assistance in acquiring some Reference 22 10. F. A. Aldaye, H. F. Sleiman, J. Am. Chem. Soc. 129, of the high-resolution AFM images, using Peak Force 13376 (2007). Tapping with ScanAsyst on the MultiMode 8. This research 18 January 2011; accepted 4 March 2011 11. Y. He et al., Nature 452, 198 (2008). was partly supported by grants from the Office of Naval 10.1126/science.1202998 Downloaded from www.sciencemag.org on April 19, 2011 0.131, df = 503, P = 0.003). To account for any Phonemic Diversity Supports a Serial non-independence within language families, the analysis was repeated, first using mean values at Founder Effect Model of Language the language family level (table S2) and then using a hierarchical linear regression framework Expansion from Africa to model nested dependencies in variation at the family, subfamily, and genus levels (15). These analyses confirm that, consistent with a founder Quentin D. Atkinson1,2* effect model, smaller population size predicts reduced phoneme inventory size both between Human genetic and phenotypic diversity declines with distance from Africa, as predicted by a families (family-level analysis r = 0.468, df = 49, serial founder effect in which successive population bottlenecks during range expansion P < 0.001; fig. S1B) and within families, con- progressively reduce diversity, underpinning support for an African origin of modern humans. trolling for taxonomic affiliation {hierarchical lin- Recent work suggests that a similar founder effect may operate on human culture and ear model: fixed-effect coefficient (b) = 0.0338 language. Here I show that the number of phonemes used in a global sample of 504 languages to 0.0985 [95% highest posterior density (HPD)], is also clinal and fits a serial founder–effect model of expansion from an inferred origin in P = 0.009}. Africa. This result, which is not explained by more recent demographic history, local language Figure 1B shows clear regional differences in diversity, or statistical non-independence within language families, points to parallel phonemic diversity, with the largest phoneme mechanisms shaping genetic and linguistic diversity and supports an African origin of inventories in Africa and the smallest in South modern human languages. America and Oceania. A series of linear regres- sions was used to predict phoneme inventory size he number of phonemes—perceptually ing the evolution of phonemes (11, 16) and lan- from the log of speaker population size and dis- T distinct units of sound that differentiate words—in a language is positively corre- lated with the size of its speaker population (1) in guage generally (17–20). This raises the possibility that the serial founder–effect model used to trace our genetic origins to a recent expansion from Africa tance from 2560 potential origin locations around the world (15). Incorporating modern speaker population size into the model controls for geo- such a way that small populations have fewer (4–9) could also be applied to global phonemic graphic patterning in population size and means phonemes. Languages continually gain and lose diversity to investigate the origin and expansion that the analysis is conservative about the amount phonemes because of stochastic processes (2, 3). of modern human languages. Here I examine geo- of variation attributed to ancient demography. If phoneme distinctions are more likely to be lost graphic variation in phoneme inventory size using Model fit was evaluated with the Bayesian in- Robin in small founder populations, then a succession J. Ryder (Paris Dauphine) Statistics and Historical Linguistics data on vowel, consonant, and tone inventories formation criterionUCLA March 2013 (BIC) (22). Following previ- 54 / 98
  56. 56. Question at hand Can we use phonemic data to reconstruct the expansion of language? Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 55 / 98
  57. 57. Premises Premise 1: Size of population is positively correlated with number of phonemes. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98
  58. 58. Premises Premise 1: Size of population is positively correlated with number of phonemes. The effect is small, but significant. Still holds when you restrict to population < 5000. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98
  59. 59. Premises Premise 1: Size of population is positively correlated with number of phonemes. Premise 2: Language expansion occurred with a series of small founder groups. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98
  60. 60. Premises Premise 1: Size of population is positively correlated with number of phonemes. Premise 2: Language expansion occurred with a series of small founder groups. Conclusion: During language expansion, the number of phonemes decreases at the frontier of expansion. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 56 / 98
  61. 61. Idea Test many different possible points of origin. From the true point of origin, you should observe a cline in phoneme diversity. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 57 / 98
  62. 62. Data Data from WALS, 504 languages (all data with languages). Three traits, grouped into classes by WALS: Tone structure (No tone / Simple tone / Complex tone) Number of vowels (2–4 / 5–6 / 7–14) Number of consonants (6–14 / 15–18 / 19–25 / 26–33 / 34+) Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 58 / 98
  63. 63. Map of languages Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 59 / 98
  64. 64. Tone map Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 60 / 98
  65. 65. Vowel map Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 61 / 98
  66. 66. Consonant map Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 62 / 98
  67. 67. Measure of phonemic diversity A measure of phonemic diversity is created. Scale and center the three components: tone, vowels, consonants (mean=0, variance=1) Diversity is measured as the unweighted sum of the three components This is probably the most contentious aspect of the paper. The issues this raises are dealt with in the Supplementary Material. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 63 / 98
  68. 68. Main result For each possible origin, perform a linear regression of phoneme diversity against distance from origin. Best regression: Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 64 / 98
  69. 69. Map of best origin Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 65 / 98
  70. 70. Validation Several other factors could have an influence. The author checks: Dependence within language families Modern population size Geographic area Population density Local language diversity (number of languages within N km, N = 500, 1000, 2000, 3000) A second origin Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 66 / 98
  71. 71. Validation Several other factors could have an influence. The author checks: Dependence within language families Modern population size Geographic area Population density Local language diversity (number of languages within N km, N = 500, 1000, 2000, 3000) A second origin Other factors certainly play a role. However, as long as they are uncorrelated with distance to Africa, they will not bias the results. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 66 / 98
  72. 72. Tones vs Vowels vs Consonants The measure of diversity is fishy. The author also checks for Tones, Vowels and Consonants separately. The conclusions are unchanged. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 67 / 98
  73. 73. Comments Statistically sound. The data are what they are... Assumptions: see multiple comments in Linguistic Typology (2011) Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 68 / 98
  74. 74. Bayesian model choice The evidence for or against a model given data is summarized in the Bayes factor: P[M = 2|y ]/P[M = 1|y ] B21 (y ) = ] P[M = 2]/P[M = 1] m2 (y ) = m1 (y ) where mk (y ) = L(θk |y )πk (θk )dθk Θk This includes an automatic penalty for complexity: when a model is more complex, the integral is over a larger space, where the likelihood is most of the time very small. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 69 / 98
  75. 75. Interpreting the Bayes factor Jeffrey’s scale of evidence states that If log10 (B21 ) is between 0 and 0.5, then the evidence in favor of model 2 is weak between 0.5 and 1, it is substantial between 1 and 2, it is strong above 2, it is decisive (and symmetrically for negative values) Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 70 / 98
  76. 76. Outline 1 Swadesh: glottochronology 2 Gray and Atkinson: Language phylogenies 3 Lieberman et al., Pagel et al.: Frequency of use 4 Perreault and Mathew, Atkinson: Phoneme expansion 5 Dunn et al.: Word-order universals 6 Bouchard-Cˆt´ et al: automated cognates oe 7 Overall conclusions Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 71 / 98
  77. 77. Dunn et al. LETTER doi:10.1038/nature09923 Evolved structure of language shows lineage-specific trends in word-order universals Michael Dunn1,2, Simon J. Greenhill3,4, Stephen C. Levinson1,2 & Russell D. Gray3 Languages vary widely but not without limit. The central goal of after the noun, whereas dominant object–verb ordering predicts post- linguistics is to describe the diversity of human languages and positions, relative clauses and genitives before the noun4. One general explain the constraints on that diversity. Generative linguists fol- explanation for these observations is that languages tend to be consist- lowing Chomsky have claimed that linguistic diversity must be ent (‘harmonic’) in their order of the most important element or ‘head’ constrained by innate parameters that are set as a child learns a of a phrase relative to its ‘complement’ or ‘modifier’3, and so if the verb language1,2. In contrast, other linguists following Greenberg have is first before its object, the adposition (here preposition) precedes the claimed that there are statistical tendencies for co-occurrence of noun, while if the verb is last after its object, the adposition follows the traits reflecting universal systems biases3–5, rather than absolute noun (a ‘postposition’). Other functionally motivated explanations constraints or parametric variation. Here we use computational emphasize consistent direction of branching within the syntactic struc- phylogenetic methods to address the nature of constraints on ture of a sentence9 or information structure and processing efficiency5. linguistic diversity in an evolutionary framework6. First, contrary To demonstrate that these correlations reflect underlying cognitive to the generative account of parameter setting, we show that the or systems biases, the languages must be sampled in a way that controls evolution of only a few word-order features of languages are for features linked only by direct inheritance from a common strongly correlated. Second, contrary to the Greenbergian general- ancestor10. However, efforts to obtain a statistically independent sample izations, we show that most observed functional dependencies of languages confront several practical problems. First, our knowledge between traits are lineage-specific rather than universal tendencies. of language relationships is incomplete: specialists disagree about high- These findings support the view that—at least with respect to word level groupings of languages and many languages are only tentatively order—cultural evolution is the primary factor that determines assigned to language families. Second, a few large language families linguistic structure, with the current state of a linguistic system contain the bulk of global linguistic variation, making sampling purely shaping and constraining future states. from unrelated languages impractical. Some balance of related, unre- Human language is unique amongst animal communication sys- lated and areally distributed languages has usually been aimed for in Robin J. Ryderonly for itsDauphine) tems not (Paris structural complexity but also for its diversity at practice11,12. Statistics and Historical Linguistics UCLA March 2013 72 / 98
  78. 78. Question at hand Do the following traits evolve independently? 1 Adjective-Noun order 2 Adposition-Noun phrase order 3 Demonstrative-Noun order 4 Genitive-Noun order 5 Numeral-Noun order 6 Object-Verb order 7 Relative clause-Noun order 8 Subject-Verb order Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 73 / 98
  79. 79. Idea Difficult to get an independent sample of languages Instead, look at languages known to be related and check for co-evolution Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 74 / 98
  80. 80. Data 4 language families: Indo-European Bantu Uto-Aztecan Austronesian Word-order data mainly from WALS. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 75 / 98
  81. 81. Lexical tree Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 76 / 98
  82. 82. Methods Start with a sample of trees constructed with lexical data. Put two traits at the leaves Perform Bayesian model choice between independent model and co-evolution model Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 77 / 98
  83. 83. Two models Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 78 / 98
  84. 84. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 79 / 98
  85. 85. Results Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 80 / 98
  86. 86. Conclusions The networks look very different. Co-evolution seems to be very different between different language families. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 81 / 98
  87. 87. What went wrong? 1 Nothing about borrowing or geography 2 No attempt at cross-validation 3 Over-interpreting the data 4 Testing too many hypotheses 5 They did not test the hypothesis that different families have the same dependencies! Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 82 / 98
  88. 88. Outline 1 Swadesh: glottochronology 2 Gray and Atkinson: Language phylogenies 3 Lieberman et al., Pagel et al.: Frequency of use 4 Perreault and Mathew, Atkinson: Phoneme expansion 5 Dunn et al.: Word-order universals 6 Bouchard-Cˆt´ et al: automated cognates oe 7 Overall conclusions Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 83 / 98
  89. 89. Automatic cognate detection? All models of lexical data take as granted cognacy judgements (and family membership) It would be nice to do without these: we could increase the number of languages under study, as well as the number of meanings Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 84 / 98
  90. 90. Levenshtein distance The Levenshtein distance (or edit distance) between two words is the number of changes (substitutions, insertions, deletions) necessary to change one word into the other. Swedish tre (/tre:/) and Latin tres (/tre:s/): distance 1 (1 insertion) Swedish tre and Dutch drie (/dri/): distance 2 (2 substitutions) (Can be normalized by word length) By calculating the distance between all words in a meaning list, get a distance between two languages. Deduce a tree (several possible algorithms: Neighbour joining, UPGMA...) Unlike other methods, this does not require cognacy judgements, and it takes into account the phonetics (more information). Serva and Petroni 2008, van der Ark et al. 2007: ”the resulting tree looks close to the truth” (except it doesn’t) Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 85 / 98
  91. 91. Levenshtein distance: so many problems! Does not require that compared words be cognates Does not require that compared languages be cousins No model of phonetic change (hence no model validation or mis-specification) Some changes are more likely that others Changes can be systematic (e.g. vowel shift), not independent See Greenhill (2010) for a detailed rebuttal. Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 86 / 98
  92. 92. Bouchard-Cˆt´ et al. (2012) oe Automated reconstruction of ancient languages using probabilistic models of sound change Alexandre Bouchard-Côtéa,1, David Hallb, Thomas L. Griffithsc, and Dan Kleinb a Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; bComputer Science Division and cDepartment of Psychology, University of California, Berkeley, CA 94720 Edited by Nick Chater, University of Warwick, Coventry, United Kingdom, and accepted by the Editorial Board December 22, 2012 (received for review March 19, 2012) One of the oldest problems in linguistics is reconstructing the building on work on the reconstruction of ancestral sequences words that appeared in the protolanguages from which modern and alignment in computational biology (8–12). Several groups languages evolved. Identifying the forms of these ancient lan- have recently explored how methods from computational biology guages makes it possible to evaluate proposals about the nature can be applied to problems in historical linguistics, but such work of language change and to draw inferences about human history. has focused on identifying the relationships between languages Protolanguages are typically reconstructed using a painstaking (as might be expressed in a phylogeny) rather than reconstructing manual process known as the comparative method. We present a the languages themselves (13–18). Much of this type of work has family of probabilistic models of sound change as well as algo- been based on binary cognate or structural matrices (19, 20), which rithms for performing inference in these models. The resulting discard all information about the form that words take, simply in- system automatically and accurately reconstructs protolanguages dicating whether they are cognate. Such models did not have the from modern languages. We apply this system to 637 Austronesian goal of reconstructing protolanguages and consequently use a rep- COMPUTER SCIENCES languages, providing an accurate, large-scale automatic reconstruc- resentation that lacks the resolution required to infer ancestral tion of a set of protolanguages. Over 85% of the system’s recon- phonetic sequences. Using phonological representations allows us structions are within one character of the manual reconstruction to perform reconstruction and does not require us to assume that provided by a linguist specializing in Austronesian languages. Be- cognate sets have been fully resolved as a preprocessing step. Rep- ing able to automatically reconstruct large numbers of languages resenting the words at each point in a phylogeny and having a model provides a useful way to quantitatively explore hypotheses about the of how they change give a way of comparing different hypothesized factors determining which sounds in a language are likely to change cognate sets and hence inferring cognate sets automatically. over time. We demonstrate this by showing that the reconstructed The focus on problems other than reconstruction in previous SYCHOLOGICAL AND COGNITIVE SCIENCES Austronesian protolanguages provide compelling support for a hy- computational approaches has meant that almost all existing pothesis about the relationship between the function of a sound protolanguage reconstructions have been done manually. How- and its probability of changing that was first proposed in 1955. ever, to obtain more accurate reconstructions for older languages, large numbers of modern languages need to be analyzed. The Proto-Austronesian language, for instance, has over 1,200 de- Robin ancestral | computational | diachronic J. Ryder (Paris Dauphine) Statistics and Historical Linguistics (21). All of these languages could potentially scendant languages UCLA March 2013 87 / 98
  93. 93. Aim Model of sound change along a tree (including systematic changes) Very large number of parameters Attempt to perform automatically the comparative method Possible uses: reconstruction of ancient forms, cognate detection, phylogenies... Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 88 / 98
  94. 94. Model Assume that the phylogeny is known. /tre:s/ Latin /ste:lla/ /tre/ /tRes/ Italian Spanish /stella/ /estReLa/ Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 89 / 98
  95. 95. Model Assume that the phylogeny is known. /tre:s/ Latin /ste:lla/ /tre/ /tRes/ Italian Spanish /stella/ /estReLa/ Latin to Italian: /tre:s/ → /tre/ P[t → t] = 0.98 P[r → r] = 0.99 P[e: → e] = 0.95 P[s → −] = 0.20 (These numbers are made up.) Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 89 / 98
  96. 96. Transition probabilities Branch Latin to Italian, transition probabilities for /t/ t 0.98 d 0.01 k 0.01 anything else 0 We would like to estimate the transition probabilities from any phoneme to any other phoneme, on each branch. Huge number of parameters (number of phonemes×number of phonemes×number of branches). Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 90 / 98
  97. 97. Context dependent In fact, the probability that /s/ is deleted depends on the position within the word. The transition probabilities become context dependent: P[s → −|preceded by e and is last letter] = 0.70 P[r → r|between t and e] = 0.99 Even more parameters! Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 91 / 98
  98. 98. Reducing the number of parameters Generic contexts of the form ”CrV” Dirichlet prior: force most parameters to be zero (sparsity) Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 92 / 98
  99. 99. Implementation Possible to use large datasets: dumps from Wiktionary, Bible... Expectation-Maximization algorithm Only point estimates of the parameters, not uncertainty Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 93 / 98
  100. 100. Results: reconstruction Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 94 / 98
  101. 101. Results: Most probable transitions Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 95 / 98
  102. 102. Bouchard et al.: Comments Impressive first step Much more refinement is possible, especially in defining the context Allows quantitative tests of open questions (e.g. functional load, universality of changes) Joint estimation of sound changes and phylogenies Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 96 / 98
  103. 103. Outline 1 Swadesh: glottochronology 2 Gray and Atkinson: Language phylogenies 3 Lieberman et al., Pagel et al.: Frequency of use 4 Perreault and Mathew, Atkinson: Phoneme expansion 5 Dunn et al.: Word-order universals 6 Bouchard-Cˆt´ et al: automated cognates oe 7 Overall conclusions Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 97 / 98
  104. 104. Overall conclusions When done right, statistical methods can provide new insight into linguistic history The conclusions will only ever be as good as the data Importance of collaboration in building the model and in checking for mis-specification Major avenues for future research. Challenges in finding relevant data, building models, and statistical inference: Models for morphosyntactical traits Putting together lexical, phonemic and morphosyntactic traits Escape the tree-like model (borrowing, networks, diffusion processes...) Incorporate geography Robin J. Ryder (Paris Dauphine) Statistics and Historical Linguistics UCLA March 2013 98 / 98

×