Closing the Gap: Data Models for Documentary Linguistics

1,951 views

Published on

Talk at Latrobe University (May 2005, Melbourne)

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,951
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
30
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Closing the Gap: Data Models for Documentary Linguistics

  1. 1. Closing the Gap: Data Models for Documentary Linguistics Baden Hughes Department of Computer Science and Software Engineering The University of Melbourne badenh@cs.mu.oz.au
  2. 2. Overview Overall Context The Electronic Data Format Challenge Common Problems Data Encoding Models Lexicons, interlinear texts, paradigms, syntactic trees, annotation standards, query languages Linguistic Motivations vs Computational Interests New Types of Data Exploration Effects on Linguistic Analysis New Tools Conclusions Latrobe Uni - Linguistics Seminar - 20050505 2
  3. 3. Overall Context Large amounts of human language data continues to be managed in electronic form and analysed in fieldwork-driven linguistic documentation Increasing focus on acquisition-centric methodologies which have vastly increased the rate of growth of linguistic data Reasonably static basic linguistic data structures largely grounded in print domain Latrobe Uni - Linguistics Seminar - 20050505 3
  4. 4. The Electronic Data Format Challenge The methods used for the digital encoding of linguistic data are often disparate Often at best reduced to native formats supported by widely-used tools such as Shoebox Conversion is typically complex and lossy Sometimes this can’t be predicted in advance Many utility manipulation functions required to move data between analytical applications and outputs These functions are largely external to analytical environments, with some notable exceptions (eg regular expression manipulation) Latrobe Uni - Linguistics Seminar - 20050505 4
  5. 5. Common Problems Despite diversity of language and analytical approach, many documentary and descriptive linguists face a common challenge: the interoperability and longevity of electronic data generated in fieldwork settings. Repurposing data Publishing data on the web Publishing in papers New analysis tools New generation formats Latrobe Uni - Linguistics Seminar - 20050505 5
  6. 6. The Emergence of Abstract Language Data Encoding Models Recently, a number formal data encoding models for linguistic data types have emerged from projects investigating quot;best practicequot; methods for preserving linguistic data. We will briefly consider models for lexicons interlinear texts paradigms syntactic trees annotation standards query languages Latrobe Uni - Linguistics Seminar - 20050505 6
  7. 7. Data Models (1) Lexicons Bell & Bird (2001) Interlinear Text Bow, Hughes & Bird (2003) Hughes, Bird & Bow (2003) Linguistic Paradigms Penton, Bow, Bird & Hughes (2004) Penton & Bird (2004) Latrobe Uni - Linguistics Seminar - 20050505 7
  8. 8. Data Models (2) Syntactic Trees Lai & Bird (2004) Annotation Standards Farrar, Lewis & Langendoen (2002) Farrar & Langendoen (2003) Query Languages Bird, Chen, Davidson, Lee & Zheng (2005) Cassidy & Bird (2000) Taylor (2004) Latrobe Uni - Linguistics Seminar - 20050505 8
  9. 9. Linguistic Motivations Data models – so what ? It is the combined utility of these models that makes them attractive to documentary linguists The challenge is to lower the barrier to use of these technologies in fieldwork and analytical contexts Linguistics (mostly) don’t care about the technology, they just want to do linguistics! Computer scientists are generally not interested in linguistics … Latrobe Uni - Linguistics Seminar - 20050505 9
  10. 10. Computational Interests The development of such models may be inherently interesting to computationally inclined researchers Human language data encoding and annotation is genuinely interesting in computer science terms; unfortunately basic data modelling isn' t Technologists have a bad habit of providing advice which is intended well but lacks traction for non-technical communities (eg “use XML”) Many of the solutions are XML-based, but contain many more components than just XML encoded data Latrobe Uni - Linguistics Seminar - 20050505 10
  11. 11. New Types of Data Exploration (1) Open implemented solutions for a range of manipulations are available Lexicons Generation of different types of lexicons Interlinear Text (see following examples …) Generation of different types of interlinear text Induction of morphosyntactic glossing from lexicons Generation of lexicons from interlinear text Enrichment of lexicons from interlinear text Latrobe Uni - Linguistics Seminar - 20050505 11
  12. 12. Nenets Interlinear (1) Latrobe Uni - Linguistics Seminar - 20050505 12
  13. 13. Nenets Interlinear (2) Latrobe Uni - Linguistics Seminar - 20050505 13
  14. 14. New Types of Data Exploration (2) Open implemented solutions for a range of manipulations are available Syntactic Trees Induction of trees from interlinear text Creation of interlinear text from syntactic tree drawing Creation of lexicons from syntactic trees Paradigms (see following examples …) Generation of different types of paradigms Induction of paradigms from interlinear text Annotation of interlinear text from paradigms Enrichment of lexicons from paradigms Latrobe Uni - Linguistics Seminar - 20050505 14
  15. 15. Kanarese Paradigm (1) Latrobe Uni - Linguistics Seminar - 20050505 15
  16. 16. Kanarese Paradigm (2) Latrobe Uni - Linguistics Seminar - 20050505 16
  17. 17. Effects on Linguistic Analysis Integrated encoding standards for linguistic data affect the practice of linguistic analysis Some analysis types are now easier New possibilities emerge New analytical challenges are discovered Data linkage/integration is certainly one of the improvements Latrobe Uni - Linguistics Seminar - 20050505 17
  18. 18. New Tools The next generation of tools which support these data models natively are emerging eg FIELD, ELAN, Toolbox (almost) “Middleware” which allows the translation of legacy formats to and from these models are reasonably widely available Analytical tools are increasingly being implemented with web-grounded technologies and using web- derived models Open source/open data approaches are becoming pervasive Latrobe Uni - Linguistics Seminar - 20050505 18
  19. 19. Conclusion Reducing the gap between computationally tractable representations on which a high degree of functionality can be built and simple underlying formats driven by fieldwork-oriented tools Reduces the intermediate data-munging steps which require technical knowledge rather than linguistic knowledge is advantageous to all parties While we are not quite “there yet”, the light at the end of the tunnel is definitely there Growing community of philosophically aligned computer scientists and linguists Latrobe Uni - Linguistics Seminar - 20050505 19
  20. 20. References Bell & Bird, 2001. A Preliminary Study of the Structure of Lexicon Entries. Proceedings of the Workshop on Web-Based Language Documentation and Description. Bow, Hughes & Bird 2003. Towards a General Model for Interlinear Text. Proceedings of EMELD 2003. Farrar, Lewis & Langendoen, 2002. A Common Ontology for Linguistic Concepts. Proceedings of the Knowledge Technologies Conference. Farrar & Langendoen, 2003. A linguistic ontology for the Semantic Web. GLOT International 7(3) Hughes, Bird & Bow, 2003. Encoding and Presenting Interlinear Text Using XML Technologies. Proceedings of ALTW 2003. Lai & Bird, 2004. Querying and Updating Treebanks: A Critical Survey and Requirements Analysis. Proceedings of ALTW 2004. Penton, Bow, Bird & Hughes, 2004. Towards a General Model for Linguistic Paradigms. Proceedings of EMELD 2004. Penton & Bird, 2004. Representing and Rendering Linguistic Paradigms. Proceedings of ALTW 2004. Bird, Chen, Davidson, Lee & Zheng, 2005. Extending XPath to Support Linguistic Queries. Proceedings of PLANX 2005. Cassidy & Bird, 2000. Querying databases of annotated speech. Proceedings of the Eleventh Australasian Database Conference. Taylor, 2004. XSLT as a Linguistic Query Language. BSc(Hons) Thesis, University of Melbourne. Latrobe Uni - Linguistics Seminar - 20050505 20
  21. 21. Questions ? Comments ? Latrobe Uni - Linguistics Seminar - 20050505 21

×