Your SlideShare is downloading. ×
The CLUES database: automated search for linguistic cognates
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The CLUES database: automated search for linguistic cognates


Published on

Overview of the design of the CLUES database, developed as an aid to the comparative method in historical linguistics. Includes information on the design of the database and the strategies used to …

Overview of the design of the CLUES database, developed as an aid to the comparative method in historical linguistics. Includes information on the design of the database and the strategies used to detect correlate forms (potential cognates), including metrics used to rate similarity of form and meaning.

Published in: Technology, Education

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. The CLUES database: automated search for cognate forms Australian Linguistics Society Conference, Canberra 4 December 2011Mark Planigale (Mark Planigale Research & Consultancy) Tonya Stebbins (RCLT, La Trobe University)
  • 2. Introduction Overview of the design of the CLUES database - being developed as a tool to aid the search for correlates across multiple datasets Linguistic model underlying the database Explore key issues in developing the methodology Show examples of output from the database Because the design of CLUES is relatively generic, it is potentially applicable to a wide range of languages, and to tasks other than correlate detection.
  • 3. Context
  • 4. What is CLUES? “Correlate Linking and User-defined Evaluation System”. Database designed to simultaneously handle lexical data from multiple languages. It uses add-on modules for comparative functions. Primary purpose: identify correlates across two or more languages.  Correlate: pair of lexemes which are similar in phonetic form and/or meaning  The linguist assesses which of the identified correlates are cognates, and which are similar due to some other reason (borrowing, universal tendencies, accidental similarity) Allows the user to adjust the criteria used to evaluate degree of correlation between lexemes. It can store, filter and organise results of comparisons.
  • 5. Computational methods in historicallinguistics Lexicostatistics Typological comparison Phylogenetics Phoneme inventory comparison Modelling effects of sound change rules Correlate search > CLUES
  • 6. A few examples Lowe & Mazaudon 1994 – „Reconstruction Engine‟ (models operation of proposed sound change rules as a means of checking hypotheses) Nakhleh et al 2005 – Indo-European, phylogenetic Holman et al. 2008 – Automated Similarity Judgment Program – 4350 languages; 40 lexical items (edit distance); 85 most stable grammatical (typological) features from WALS database. Austronesian Basic Vocabulary Database: 874 mostly Austronesian languages, each language represented by around 210 words. project had phylogenetic focus – did some manual comparative work in preparing the data) Greenhill & Gray 2009 – Austronesian, phylogenetic Dunn, Burenhult et al. 2011 – Aslian Proto-TaioMatic (“merges, searches, and extends several wordlists and proposed reconstructions of proto-Tai and Southwestern Tai”
  • 7. Broad vs. deep approaches to automatedlexical comparisonParameter ‘Broad and shallow’ ‘Narrow and deep’Language sample Relatively large Relatively smallVocabulary Constrained, based on All available lexical data forsample standardised wordlist (e.g. selected languages Swadesh 200, 100 or 40)Purpose Establish (hypothesised) Linguistic and/or cultural genetic relationships reconstruction; model language contact and semantic shiftMethod Lexicostatistics Comparative method with fuzzy Phylogenetics matchingTypical metrics Phonetic (e.g. edit distance) Phonetic (e.g. edit distance) Typological (shared Semantic grammatical features) Grammatical Maximum likelihoodCLUES comparisons can be constrained to core vocab (using wordlist feature)however it is intended to be used within a „narrow and deep‟ approach.
  • 8. Design of CLUES
  • 9. CLUES: Desiderata • Results agree with human expert judgment Accuracy • Minimisation of false positives and negatives • Computed similarity level does measure degree of Validity correlation • Computed similarity level varies directly with cognacy • Like results for like comparison pairs Reliability • Like results for a single comparison pair on repetition • System performs accurately on new („unseen‟) data as wellGeneralisability as the data that the similarity metrics were „trained‟ on Efficiency • Comparisons are performed fast enough to be useful
  • 10. Lexical model (partial) Language 1 ∞ Lexeme Orthography 1 ∞ part of speech ∞ 1 Source temporal information 1 1 ∞ ∞ Sense ∞ ∞ Wordlist item Written form ∞ 1 1 ... ∞ ∞ ∞ Gloss Semantic Phone domain
  • 11. Three dimensions of lexical similarityDimension of comparison Data fields currently availablePhonetic / phonological Written form (mapped to(phonetic form of lexeme) phonetic content)Semantic Semantic domain(meaning of lexeme) GlossGrammatical Word class(grammatical features oflexeme) In the context of correlate detection, grammatical features may be of interest as a „dis-similarising‟ feature for lexemes that are highly correlated on form and meaning.
  • 12. What affects the results? Selection and evaluation of metrics • Choice of appropriate formal (quantifiable) criteria for similarity • Impact: Validity of results; generalisability of system Inconsistent representations • Systematic differences in the representations used for different data sets within the corpus • Impact: Validity of results Noise • Random fluctuations within the data that obscure the true value of individual data items, but do not change the underlying nature of the distribution • Impact: Reliability of data, reliability of results less controllabl e
  • 13. CLUES: Managing representational issues automated generation of phonetic form(s) from written form(s) where required, manual standardization to common lexicographic conventions manual assignment to common ontology  semantic domain set automated mapping onto a shared common set of grammatical features, values and terms
  • 14. Calculating similarity
  • 15. Similarity scores Overall Total score w5 w6 w7 Meaning Grammar Subtotals Form subtotal subtotal subtotal w1 w2 w3 w4 Semantic Written form Gloss Wordclass Base domain similarity similarity similarity similarity
  • 16. Ura ɣunǝga vs. Mali kunēngga ‘sun’ Base4a. Lexeme 1 Lexeme 2 similarity Weight Subtotal Weight Overall score ɣunǝga kunēnggaWritten form(s) 0.896 1.0 0.896 0.45 [ɣunǝga] [ɣunǝŋga]Gloss(es)Semantic domain(s)Wordclass sun A3 N sun A3 N 1.0 1.0 1.0 0.5 0.5 1.0 } 1.0 1.0 0.45 0.1 } 0.953
  • 17. Sulka kolkha ‘sun’ vs. Mali dulka ‘stone’ Base 4b. Lexeme 1 Lexeme 2 similarity Weight Subtotal Weight Overall score kolkha dulka Written form(s) 0.828 1.0 0.828 0.45 [kolkha] [dulka] Gloss(es) Semantic domain(s) sun A3 stone A5 0.0 0.333 0.5 0.5 } 0.167 0.45 } 0.548 Wordclass N N 1.0 1.0 1.0 0.1 Base 4c. Lexeme 1 Lexeme 2 similarity Weight Subtotal Weight Overall score kolkha dulka Written form(s) 0.828 1.0 0.828 0.7 [kolkha] [dulka] Gloss(es) Semantic domain(s) sun A3 stone A5 0.0 0.333 0.5 0.0 } 0.0 0.2 } 0.68 Wordclass N N 1.0 1.0 1.0 0.1
  • 18. Sample results: across domains Small set of lexical data from 7 languages; „symmetrical‟; overall scores (tau) N J1 (ura) N A3 (sul) N A5 (sul) N J1 (mal) N A3 (ura) N J1 (qaq) N T1 (mal) V T1 (mal) N J1 5a. kabarak ɣunǝga kre ka ptaik kunēngga slǝp ltigi lēt slēpki blood sun stone skin sun bone fire light a fire bone (tau) N J1 1 0.309 0.2905 0.657 0.2995 0.5435 0.278 0.2435 0.541 kabarak blood (ura) N A3 0.309 1 0.34725 0.2665 0.948 0.312 0.3515 0.2445 0.325 ɣunǝga sun (sul) N A5 0.2905 0.34725 1 0.2615 0.33875 0.3395 0.2825 0.294 0.2745 kre stone (sul) N J1 0.657 0.2665 0.2615 1 0.2895 0.5275 0.2835 0.226 0.587 ka ptaik skin (mal) N A3 0.2995 0.948 0.33875 0.2895 1 0.289 0.3025 0.22 0.3495 kunēngga sun (ura) N J1 0.5435 0.312 0.3395 0.5275 0.289 1 0.326 0.3815 0.8905 slǝp bone (qaq) N T1 0.278 0.3515 0.2825 0.2835 0.3025 0.326 1 0.6945 0.371 ltigi fire (mal) V T1 0.2435 0.2445 0.294 0.226 0.22 0.3815 0.6945 1 0.307 lēt light a fire (mal) N J1 0.541 0.325 0.2745 0.587 0.3495 0.8905 0.371 0.307 1 slēpki bone
  • 19. Sample results: within a domain (kua) N M1 (qaq) N A5 (ura) N A5 (mal) N A5 (tau) N A5 (sul) N A5 (kua) N A5 (sia) N A5 5b. Overall dududul dul dul dulka aaletpala kre vat fat similarity score fighting stone stone stone stone stone stone stone stone (qaq) N A5 1 1 0.875 0.6945 0.7355 0.759 0.739 0.425 dul stone (ura) N A5 1 1 0.875 0.6945 0.7355 0.759 0.739 0.425 dul stone (mal) N A5 0.875 0.875 1 0.776 0.79 0.7205 0.7355 0.426 dulka stone (tau) N A5 0.6945 0.6945 0.776 1 0.7375 0.727 0.73 0.3815 aaletpala stone (sul) N A5 0.7355 0.7355 0.79 0.7375 1 0.7785 0.798 0.3075 kre stone (kua) N A5 0.759 0.759 0.7205 0.727 0.7785 1 0.9805 0.3095 vat stone (sia) N A5 0.739 0.739 0.7355 0.73 0.798 0.9805 1 0.298 fat stone (kua) N M1 0.425 0.425 0.426 0.3815 0.3075 0.3095 0.298 1 dududul fighting stone
  • 20. Sample results: within a domain (kua) N M1 (qaq) N A5 (ura) N A5 (mal) N A5 (tau) N A5 (sul) N A5 (kua) N A5 (sia) N A5 5c. Form similarity dududul dul dul dulka aaletpala kre vat fat only fighting stone stone stone stone stone stone stone stone (qaq) N A5 1 1 0.75 0.389 0.471 0.518 0.478 0.6 dul stone (ura) N A5 1 1 0.75 0.389 0.471 0.518 0.478 0.6 dul stone (mal) N A5 0.75 0.75 1 0.552 0.58 0.441 0.471 0.602 dulka stone (tau) N A5 0.389 0.389 0.552 1 0.475 0.454 0.46 0.513 aaletpala stone (sul) N A5 0.471 0.471 0.58 0.475 1 0.557 0.596 0.365 kre stone (kua) N A5 0.518 0.518 0.441 0.454 0.557 1 0.961 0.369 vat stone (sia) N A5 0.478 0.478 0.471 0.46 0.596 0.961 1 0.346 fat stone (kua) N M1 0.6 0.6 0.602 0.513 0.365 0.369 0.346 1 dududul fighting stone
  • 21. Metrics A wide variety of metrics can be implemented and „plugged into‟ the comparison strategy Metrics return a real value in range [0.0, 1.0] representing the level of similarity of the items being compared User can control which set of metrics is used Can use multiple comparison strategies on the same data set and store and compare results Metrics discussed here are those used to produce the sample results General principle: “best match” – prefer false positives to false negatives
  • 22. Phonetic form similarity metric “edit distance with phone substitution probability matrix” f1, f2 := phonetic forms being compared (lists of phones – generated automatically from written forms, or transcribed manually) Apply edit distance algorithm to f1 and f2 with following costs:  Deletion cost = 1.0 (constant)  Insertion cost = 1.0 (constant)  Substitution cost = 2 x (1 - sp), where sp is phone similarity. Substitution cost falls in range [0.0, 2.0] dmin := minimum edit distance for f1 and f2 dmax := maximum possible edit distance for f1 and f2 (sum of lengths of f1 and f2 ) Similarity = 1 – (dmin / dmax) Finds maximal unbounded alignment of two forms. Can also be understood as detecting contribution of each form to a putative combined formExamples:mbias vs. biaska dmin= 3 dmax= 11 Similarity = 1-(3/11) = 0.727 mbiaskavat vs. fat dmin= 0.236 dmax= 6 Similarity = 1-(0.236/6) = 0.96 {v,f}at
  • 23. Phone similarity metric Phone similarity sp for a pair of phones is a real number in range [0, 1] drawn from a phone similarity matrix Matrix calculated automatically on the basis of weighted sum of similarities between phonetic features of the two phones  Examples of phonetic features include nasality (universal), frontness (vowels), place of articulation (consonants)  Each phonetic feature has a set of possible values and a similarity matrix for these values. Similarity matrix is user- editable  Feature similarity matrix should reflect probability of various paths of diachronic change  Possible to under-specify feature values for phones Similarity of a phone with itself will always be 1.0 „Default‟ similarities can be overridden for particular phones (universal) and/or phonemes (language pair-
  • 24. Semantic domain similarity metric “depth of deepest subsumer as proportion of maximum local depth A of semantic domain tree” n1, n2 := the semantic domains B C ... being compared (nodes in semantic domain tree) S := „subsumer‟: deepest node in D E semantic domain tree that subsumes both n1 and n2 F ds := depth of S in tree (path length from root node to S) dm := maximum local depth of tree Examples: (length of longest path from root node to an ancestor of n1 or n2) F vs. F = 1.0 Similarity = ds / dm D vs. E = 0.333 B vs. C = 0.0 See also Li et al. (2003)
  • 25. Gloss similarity metric Crude sentence comparison metric: Examples: “proportion of tokens in common” g1, g2 := the glosses being „house‟ vs. „house‟ = 1.0 compared „house‟ vs. „a house‟ = 1.0 r1, r2 := reduced glosses (after removal of stop words, e.g. a, the, „house‟ vs. „raised sleeping of) house‟ = 0.333 len1, len2 := length of r1, r2 (number of tokens) „house‟ vs. „hut‟ = 0.0 L := max (len1, len2) If L = 0, Similarity = 1.0, else: C := count of common tokens (tokens that appear in both r1 and r2) Similarity = C / L This metric needs refinement
  • 26. Conclusion
  • 27. Possible extensions; unresolved questions Extensions: find borrowings; detect duplicate lexicographic entries; orthographic conversion; ... Analytical questions: How to represent tone and incorporate within phonetic comparison? Phonetic feature system – multi valued or binary? Segmentation (comparison at phone, phone sequence or phoneme level)? The Edit distance metric may be improved by privileging uninterrupted identical sequences. Elaborate semantic matching: more sophisticated approaches using: taxonomies e.g. WordNet, with some way to map lexemes onto concepts; compositional semantics – primitives. Performance: Since comparison is parameterised, it may be possible to use genetic algorithms to optimise performance. Need a quantitative way to evaluate performance of system. Relation to theory: How much theory is embedded in the instrument? What effect does this have on results? Inter-operability between databases is a key issue in the ultimate usability of the tool.
  • 28. Acknowledgements Thanks to Christina Eira, Claire Bowern, Beth Evans, Sander Adelaar, Friedel Frowein and Sheena Van Der Mark, and Nicolas Tournadre for their comments and suggestions on this project.
  • 29. References Bakker, Dik, André Müller, Viveka Velupillai, Søren Wichmann, Cecil H. Brown, Pamela Brown, Dmitry Egorov, Robert Mailhammer, Anthony Grant, and Eric W. Holman. 2009. Adding typology to lexicostatistics: a combined approach to language classification. Linguistic Typology 13: 167-179. Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, and Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica 42.2: 331-354. Atkinson et al 2005. Li, Yuhua, Bandar Z, McLean D (2003) “An approach for measuring semantic similarity using multiple information sources,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no.4, pp. 871-882. Nakhleh, Luay, Don Ringe, and Tandy Warnow (2005). Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Language 81: 382- 420.(from Bakker et al. 2009) Lowe, John Brandon and Martine Mazaudon. 1994. The reconstruction engine: a computer implementation of the comparative method," Association for Computational Linguistics Special Issue in Computational Phonology 20.3:381-417.