Language Documentation in the 21st Century


Published on

Lecture given at University of Hong Kong Linguistics Department, 13 September 2013

Published in: Technology, Education
Language Documentation in the 21st Century

  1. 1. 1 Language Documentation in the 21st Century Prof Peter K. Austin Endangered Languages Academic Programme Department of Linguistics, SOAS Department of Linguistics, University of Hong Kong 13th September 2013
  2. 2. 2 © 2013 Peter K. Austin Creative commons licence: Attribution-NonCommercial-NoDerivs CC BY-NC-ND
  3. 3. 3 Outline • Language documentation in 1995 and today • Establishing principles for the field • Developments since 2005 • Some current challenges • Conclusions
  4. 4. 4 Language documentation • “concerned with the methods, tools, and theoretical underpinnings for compiling a representative and lasting multipurpose record of a natural language or one of its varieties” (Himmelmann 1998) • has developed over the 20 years in large part in response to the urgent need to make an enduring record of the world’s many endangered languages and to support speakers of these languages in their desire to maintain them, fuelled also by developments in information and communication technologies • essentially concerned with roles of language speakers and their rights and needs
  5. 5. 5 Publications: books and journals • Gippert et al 2006 Essentials of Language Documentation. Mouton • Tsunoda 2006 Language endangerment and language revitalization: an introduction • Language Documentation and Description – 11 issues (2,000+ copies sold), 1 in prep • Language Documentation and Conservation – 6 issues (on-line only) • Cambridge Handbook of Endangered Languages 2011 • Routledge Essential Readings 2011 • Oxford Bibliography Online 2012
  6. 6. 6 DoBeS projects
  7. 7. 7 ELAR deposits
  8. 8. 8 Main features (Himmelmann 2006:15) • Primary data – collection and analysis of an array of primary language data to be made available for a wide range of users; • Accountability – access to primary data and representations of it makes evaluation of linguistic analyses possible and expected; • Long-term storage and preservation of primary data – includes a focus on archiving in order to ensure that documentary materials are made available to potential users now and into the distant future;
  9. 9. 9 Main features (cont.) • Interdisciplinary teams – documentation requires input and expertise from a range of disciplines and is not restricted to mainstream (“core”) linguistics alone • Cooperation with and direct involvement of the speech community – active and collaborative work with community members both as producers of language materials and as co-researchers • Outcome is annotated and translated corpus of archived representative materials on a language
  10. 10. 10 Stuart McGill Cicipu corpus
  11. 11. 11 Cicipu Toolbox
  12. 12. 12 Critique: Dobrin, Austin & Nathan 2007 • “subtle and pervasive kinds of commoditisation (reduction of languages to common exchange values) abound, particularly in competitive and programmatic contexts such as grant-seeking and standard-setting where languages are necessarily compared and ranked” • archivism: quantifiable properties such as recording hours, data volume, and file parameters, and technical desiderata like ‘archival quality’ and ‘portability’ have become reference points in assessing the aims and outcomes of language documentation – these are not measures of quality documentary dog archiving tail X
  13. 13. 13 Skills issues • video madness: video recordings are made without reference to hypotheses, goals, or methodology, simply because the technology is available, portable and relatively inexpensive • audio skills are lacking: documentary linguists show little or no knowledge about recording arts and microphone types, properties and placement (microphone choice and handling is the single greatest determiner of recording quality) • corpus taming : documentary linguists show little ability at corpus and metadata management, ranging from file naming to bundle organisation
  14. 14. 14 Myopia (Austin 2012) • ILG blindness: many documenters believe that interlinear glossing is the “gold standard” of annotation but it is very time-consuming and illegible to non-linguists – overview annotation may be a preferred as a primary goal: “roadmap” or index of a recording – approximately time-aligned information about what is in the recording, who is participating, and other interesting phenomena • Toolbox and ELAN as “Nietsche’s typewriter” (link)
  15. 15. 15 • with no guiding framework for assessing quality, progress, and value in their work, documentary linguists fall back on established patterns, referring to quantifiable indices of language vitality or technical standards for the density of acoustic information even when these are not rationalised by the particular language or research situation • diversity (goals, contexts, people) – move away from “Noah’s Ark” projects to more specialised documentation, eg. ELDP 2012 grant list • we need more and better attention to goals, methods, skills, outcomes and values of language documentation
  16. 16. 16 A 21st century model Woodbury 2011 enlarges concept of language documentation: “creation, annotation, preservation and dissemination of transparent records of a language.” and identifies several gaps in a Himmelmann-type approach: “While simple in concept, it is complex and multifaceted in practice because: • its object, language, encompasses conscious and unconscious knowledge, ideation and cognitive ability, as well as overt social behaviour; • records of these things must draw on concepts and techniques from linguistics, ethnography, psychology, computer science, recording arts and more;
  17. 17. 17 A 21st century model • the creation, annotation, preservation and dissemination of such records pose new challenges in all these fields, as well as information and archival sciences and; • “above all, humans experience their own and other people’s languages viscerally and have differing stakes, purposes, goals and aspirations for language records and language documentation” Woodbury emphasises: • Diversity of goals, purposes and outcomes • Need for a theory of the documentary corpus • Need for accounts of individual project designs
  18. 18. 18 Need for meta-documentation (Austin 2013) • meta-documentation concerns the theory and practices of meta-data, data about the data being collected and analysed • metadata: • is needed for identification, management, retrieval of the data • provides the context and understanding of that data • carries those understandings into the future, and to others (and hence is important for archiving and preservation) • reflects knowledge and practices of data providers
  19. 19. 19 Metadata • defines and constrains audiences and usages for the data • all value-adding to recordings of events involves the creation of metadata – all annotations (transcriptions, translations, glosses, pos tagging, etc.) are metadata (Nathan and Austin 2004)
  20. 20. 20 Metadata gaps • recommendations for creating metadata for language documentation have been primarily influenced by library concepts (eg. Dublin Core), and key metadata notions have been interoperability, standardisation, discovery, and access (OLAC, EMELD, Farrar & Langendoen 2003). • the goals of language documentation mean this is not powerful enough and we need a theory of metadata, largely lacking until now • Nathan (2010): “meta-documentation is the documentation of your data itself, and the conditions (linguistic, social, physical, technical, historical, biographical) under which it was produced. Such meta-documentation should be as rich and appropriate as the documentary materials themselves”
  21. 21. 21 Missing meta-documentation categories • identity of stakeholders involved and their roles in the project • attitudes and ideologies of language consultants, both towards their languages and towards the documenter and documentation project • relationships with consultants and community • goals and methodology of researcher, including research methods and tools (see Lüpke 2010), corpus theorisation (Woodbury 2011), theoretical assumptions embedded in annotation (abbreviations, glosses), potential for revitalisation
  22. 22. 22 • biography of the project, including background knowledge and experience of the researcher and main consultants (eg. how much fieldwork the researcher had done at the beginning of the project and under what conditions, what training the researcher and consultants had received) • for funded projects, includes original grant application and any amendments, reports to the funder, email communications with the funder and/or any discussions with an archive (eg. reviews of sample data)
  23. 23. 23 Archiving in the 21st century • Two major approaches have emerged • ‘big data’ archiving • archiving inspired by social media models
  24. 24. 24 Big data archiving • e.g. MPI-Nijmegen • CLARIN, DARIAH, VLO • “integrated digital research environments that allow researchers to combine resources and tools from various sources in a seamless way” (Trilsbeek & Koenig 2013) • component metadata initiative (CIMDI) • mandatory to link each field to a concept definition in a central data category registry called ISOcat • goal of data mining and cross-corpus extraction, use of large scale computational linguistics tools
  25. 25. 25 Archive 2.0: social media models • traditionally archiving focussed heavily on preservation • however documentation often deals with highly sensitive topics (sacred stories, gossip) • needs powerful but flexible access management • transparency – ease of understanding • use positively – social networking model • access through relationships • relationships and sharing produce new opportunities • ELAR URCS system
  26. 26. 26 ELAR URCS system • e.g. Trevor Johnson Auslan deposit • Logged in user displays
  27. 27. 27 OAIS model OAIS archives define three types of ‘packages’ ingestion, archive, dissemination: Archive Dissemination afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds afd_34 dfa dfadf fds fdafds IngestionProducers Designated communities
  28. 28. 28 ELAR archive 2.0 model
  29. 29. 29 Rethinking the archive model • progressive archiving – a challenge to whole approach of documentary linguistics • establish user account at beginning of project – users add and manage/update resources over time • user accounts show access and usage/downloads analytics – cf.
  30. 30. 30 “classical” archiving collect resources/data archive them Collect, process, publish Archive And hope that death does not intervene progressive archiving
  31. 31. 31 Rethinking archive participation • users e.g. add bookmarks, negotiate access • depositors e.g. updating and editing content • negotiate access • monitoring usage • collaborations • exchange & share information • establish groups • community curation
  32. 32. 32 User xx has just applied for access to restricted material in the deposit johnston2012auslan. The following message was attached to the application: "Hello [depositor], xx here. I'm interested in having a look at some of your video deposit, including annotation files. I am working on a project documenting Central Australian Indigenous sign with yy (see If ok, I'd like to see how you do the annotation - we have worked out a template and annotation protocol, but this needs a lot of refinement. Regards, MC"
  33. 33. 33 This email is to inform you that user xx's application for access to restricted material in the deposit kunbarlang-389 has just been approved. The depositor included the following note to the user: "Hi xx I've approved your access to this collection, but you should know that there is an update in the material I've just deposited, with much more information on both music and texts. I'd be happy to give you access to that when it is processed. Next time I come to London (October or November this year) I'd be happy to meet up if you would like to discuss."
  34. 34. 34 User xx has just applied for access to restricted material in the deposit cappadocian-375. The following message was attached to the application: "Dear [depositor], I work as a research assistant in Nevsehir University in Cappadocia, Turkey. As you know, Cappadocian language has some relics in this region despite speakers of Cappadocian do not live anymore. In my university, there are few research on this subject with collaboration of Greek friends and local societies … I would like to access to your material … By the way, i would like to interview with you about Cappadocian language for our international journal of art and language. I hope you will have time for our journal . Thank you in advance."
  35. 35. 35 This email is to inform you that user xx's application for access to restricted material in the deposit johnston2012auslan has just been approved. The depositor included the following note to the user: "I am giving you user access which means you should be able to see the ELAN eaf annotation files for the topics "The boy who cried wolf" and for "The hare and the tortoise. You should also be able to see most other movies except those tagged "1a" "4a" and "5". If you cannot see the ELAN eaf annotations I hope the problem will be fixed soon. I told the ELAR team about this."
  36. 36. 36 Applied documentation • Should documentation contribute to sustaining language and cultural diversity and the communities who want to maintain and develop them? • What would documentary linguistics look like if it took revitalisation (and pedagogy) as its primary goal – e.g. types of data, learner-directed language, sequencing? See Nathan & Fang 2013 • Are there mismatches between linguists’ ideologies of endangered languages and documentation and community ideologies? See Austin & Sallabank 2014
  37. 37. 37 Examples • emergence of examples of applied language documentation and language and cultural revitalisation, eg. papers in LDD 11, Wuqu’ Kawoq (from Guatemala), Maori (from New Zealand) • this year I have been involved in a project with the Dieri Aboriginal Corporation in Australia aimed at cultural and linguistic repatriation and revival which has taught me a lot about links between primary documentation and its applications
  38. 38. 38 … it seems that in general many documenters are struggling with formal aspects of their documentary work because of a late recognition by leaders in documentary linguistics that a good language documentation might be very much more than a set of dozens, hundreds, or thousands of files in archiveable formats.” (Nathan 2012)
  39. 39. 39 Conclusions • we need to move beyond 20th century models of language documentation and archiving and become more reflexive and analytical about our goals, practices, methods and values • we need to bring more of the social aspect of human life into language documentation and linguistic research (where it has been largely missing for the past 20 years of renewed interest in endangered languages) replacing objectification and commodification with concern for what is special and unique about the contexts, and the people, cultures and languages we are attempting to document and support
  40. 40. 40 Thank you!