Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

  • 168 views
Uploaded on

After 20 years of multimedia data collection from endangered languages and consequent creation of extensive corpora with large amounts of annotated linguistic data, a new trend in Language …

After 20 years of multimedia data collection from endangered languages and consequent creation of extensive corpora with large amounts of annotated linguistic data, a new trend in Language Documentation is now observable. It can be described as a shift from data collection and qualitative language analysis to quantitative language comparison based on the data previously collected. However, the heterogeneous annotation types and formats in the corpora hinder the application of new developed computational methods in their analysis. A standardized representation is needed. Poio API, a scientific software library written in Python and based on Linguistic Annotation Framework, fulfills this need and establishes the bridge between Language Documentation and Natural Language Processing (NLP). Hence, it represents an innovative approach which will open up new options in interdisciplinary collaborative linguistic research. This paper offers a contextualization of Poio API in the framework of current linguistic and NLP research as well as a description of its development.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
168
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Poio API - An annotation framework to bridgeLanguage Documentation and Natural Language Processing Centro Interdisciplinar de Documentação Linguística e Social Minde/Portugal Vera Ferreira, vferreira@cidles.eu Peter Bouda, pbouda@cidles.eu António Lopes, alopes@cidles.eu Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 2. Language documentation● Aim of developing a "lasting, multipurpose record of a language"● Collection, distribution, and preservation of primary data of a variety of communicative events● Data is normally transcribed, translated, and it should also be annotated● Archives to preserve and publish documentation ● The Language Archive ● Endangered Languages Archive (ELAR) Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 3. Natural Language Processing● Any kind of computer manipulation of natural language● Mostly for „major“ languages like English, Spanish, German, etc.● NLP is rarely used on LD data● Archiving needs led to digitization● Now we see „corpus-based XYZ“ in General Linguistics● Indiviual examples are hand-picked● (Semi-)automated tagging of lesser-known languages Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 4. Quantitative Language Comparison● In contrast to „corpus linguistics“ (see Michael Cysouws research group)● Based on LD data, bible texts, movie subtitles etc.● Supports typological research Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 5. Annotation Graphs, LAF and GrAF● ISO standard 24612 "Language resource management - Linguistic annotation framework (LAF)“● Annotation graphs as the underlying data model for linguistic annotations● Developed for MASC of the American National Corpus● Existing connectors for UIMA and GATE● Radical stand-off approach ● Unsupervised collaboration Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 6. Poio API● Part of Clarin-D curation project at the University of Cologne● Connectors to „The Language Archive“ and Clarin Weblicht● Layered architecture ● API ● Internal representation (LAF) ● File format plugins (EAF, Toolbox, TCF)● Based on PyAnnotation and graf-python Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 7. Poio APICentro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 8. Data Structure Types (1/2)● List of lists, tree structure ● [ ’utterance’, [’word’, ’wfw’], ’translation’ ]● For example GRAID (Grammatical Relations and Animacy in Discourse) ● [ ’utterance’, [’clause unit’, [ ’word’, ’wfw’, ’graid1’], ’graid2’], ’translation’ ] Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 9. Data Structure Types (2/2)● Objective ● Mapping the tree structures into GrAF structure● Advantages ● Flexibility in the construction of annotation hierarchies ● Automatic transformation of the tree structures into a user interface (Poio Editor and Analyzer) ● Customization and colloboration● Disadvantages ● Not all annotation schemes can be mapped onto a tree-like structure Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 10. Annotation TreeCentro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 11. Graf-python (1/3)● Python implementation of GrAF ● Developed by Stephen Matysik for ANC● Provides the underlying data structure for all data and annotations that Poio API can manage (interoperability) ● Accessing the nodes, edges, regions and their annotations from the parsed files (GrAF ISO) Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 12. Graf-python (2/3)● Example: accessing the nodes in a „graid1“ tierBlock Code: Result:gparser = GraphParser() NodeID = word-n1file = example-graid1.xml Annotation(word, a-112)file_stream = codecs.open(file, r, utf-8) Annotation(graid1, a-508)g = gparser.parse(file_stream) compfor node in g.nodes: NodeID = word-n2 print(node) Annotation(word, a-113) for annotation in node.annotations: Annotation(graid1, a-509) print(annotation) deti graid1 = annotation.features.get(graid1) NodeID = word-n3 if graid1 is not None: Annotation(word, a-114) print(graid1) Annotation(graid1, a-510) np.h:s=cop:predp Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 13. Graf-python (3/3) Utterance 1 Region [0-20] Word-n1 Word-n2 Region [0 2] Region [3 7] „ki“ „comp“ „yag“ „deti“ word graid1 word graid1Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 14. The future: Usage of graphs● Graph-coloring algorithm to provide insight on LD data ● make common subgraphs visible after merge of corpora● Graph-traversal algorithms to collect statistical data ● Clusters of annotation values● Weighted graphs to reflect links between sources ● Quantitative Historical Linguistics with dictionaries ● Linked via spanish translations Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 15. Thank you for your attention!Centro Interdisciplinar de Documentação Linguística e Social Minde/Portugal Vera Ferreira, vferreira@cidles.eu Peter Bouda, pbouda@cidles.eu António Lopes, alopes@cidles.euCentro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  • 16. Links● Poio (API): http://media.cidles.eu/poio/● ISO 24612: http://www.iso.org/iso/catalogue_detail.htm?csnumb● The Language Archive:http://tla.mpi.nl/● Weblicht: http://weblicht.sfs.uni-tuebingen.de/index.shtml Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012