Poio API - An annotation framework to bridgeLanguage Documentation and Natural Language                  Processing      C...
Language documentation●   Aim of developing a "lasting, multipurpose record of a    language"●   Collection, distribution,...
Natural Language Processing●   Any kind of computer manipulation of natural language●   Mostly for „major“ languages like ...
Quantitative Language Comparison●   In contrast to „corpus linguistics“ (see Michael    Cysouws research group)●   Based o...
Annotation Graphs, LAF and GrAF●   ISO standard 24612 "Language resource    management - Linguistic annotation framework (...
Poio API●   Part of Clarin-D curation project at the University of    Cologne●   Connectors to „The Language Archive“ and ...
Poio APICentro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu                          @ ACRH...
Data Structure Types (1/2)●   List of lists, tree structure    ●   [ ’utterance’,          [’word’, ’wfw’],        ’transl...
Data Structure Types (2/2)●   Objective    ●   Mapping the tree structures into GrAF structure●   Advantages    ●   Flexib...
Annotation TreeCentro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu                         ...
Graf-python (1/3)●   Python implementation of GrAF    ●   Developed by Stephen Matysik for ANC●   Provides the underlying ...
Graf-python (2/3)●   Example: accessing the nodes in a „graid1“ tierBlock Code:                                           ...
Graf-python (3/3)                                                          Utterance 1                                    ...
The future: Usage of graphs●   Graph-coloring algorithm to provide insight on LD data    ●   make common subgraphs visible...
Thank you for your attention!Centro Interdisciplinar de Documentação Linguística e Social                        Minde/Por...
Links●   Poio (API): http://media.cidles.eu/poio/●   ISO 24612:    http://www.iso.org/iso/catalogue_detail.htm?csnumb●   T...
Upcoming SlideShare
Loading in …5
×

Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

472 views

Published on

After 20 years of multimedia data collection from endangered languages and consequent creation of extensive corpora with large amounts of annotated linguistic data, a new trend in Language Documentation is now observable. It can be described as a shift from data collection and qualitative language analysis to quantitative language comparison based on the data previously collected. However, the heterogeneous annotation types and formats in the corpora hinder the application of new developed computational methods in their analysis. A standardized representation is needed. Poio API, a scientific software library written in Python and based on Linguistic Annotation Framework, fulfills this need and establishes the bridge between Language Documentation and Natural Language Processing (NLP). Hence, it represents an innovative approach which will open up new options in interdisciplinary collaborative linguistic research. This paper offers a contextualization of Poio API in the framework of current linguistic and NLP research as well as a description of its development.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
472
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

  1. 1. Poio API - An annotation framework to bridgeLanguage Documentation and Natural Language Processing Centro Interdisciplinar de Documentação Linguística e Social Minde/Portugal Vera Ferreira, vferreira@cidles.eu Peter Bouda, pbouda@cidles.eu António Lopes, alopes@cidles.eu Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  2. 2. Language documentation● Aim of developing a "lasting, multipurpose record of a language"● Collection, distribution, and preservation of primary data of a variety of communicative events● Data is normally transcribed, translated, and it should also be annotated● Archives to preserve and publish documentation ● The Language Archive ● Endangered Languages Archive (ELAR) Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  3. 3. Natural Language Processing● Any kind of computer manipulation of natural language● Mostly for „major“ languages like English, Spanish, German, etc.● NLP is rarely used on LD data● Archiving needs led to digitization● Now we see „corpus-based XYZ“ in General Linguistics● Indiviual examples are hand-picked● (Semi-)automated tagging of lesser-known languages Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  4. 4. Quantitative Language Comparison● In contrast to „corpus linguistics“ (see Michael Cysouws research group)● Based on LD data, bible texts, movie subtitles etc.● Supports typological research Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  5. 5. Annotation Graphs, LAF and GrAF● ISO standard 24612 "Language resource management - Linguistic annotation framework (LAF)“● Annotation graphs as the underlying data model for linguistic annotations● Developed for MASC of the American National Corpus● Existing connectors for UIMA and GATE● Radical stand-off approach ● Unsupervised collaboration Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  6. 6. Poio API● Part of Clarin-D curation project at the University of Cologne● Connectors to „The Language Archive“ and Clarin Weblicht● Layered architecture ● API ● Internal representation (LAF) ● File format plugins (EAF, Toolbox, TCF)● Based on PyAnnotation and graf-python Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  7. 7. Poio APICentro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  8. 8. Data Structure Types (1/2)● List of lists, tree structure ● [ ’utterance’, [’word’, ’wfw’], ’translation’ ]● For example GRAID (Grammatical Relations and Animacy in Discourse) ● [ ’utterance’, [’clause unit’, [ ’word’, ’wfw’, ’graid1’], ’graid2’], ’translation’ ] Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  9. 9. Data Structure Types (2/2)● Objective ● Mapping the tree structures into GrAF structure● Advantages ● Flexibility in the construction of annotation hierarchies ● Automatic transformation of the tree structures into a user interface (Poio Editor and Analyzer) ● Customization and colloboration● Disadvantages ● Not all annotation schemes can be mapped onto a tree-like structure Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  10. 10. Annotation TreeCentro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  11. 11. Graf-python (1/3)● Python implementation of GrAF ● Developed by Stephen Matysik for ANC● Provides the underlying data structure for all data and annotations that Poio API can manage (interoperability) ● Accessing the nodes, edges, regions and their annotations from the parsed files (GrAF ISO) Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  12. 12. Graf-python (2/3)● Example: accessing the nodes in a „graid1“ tierBlock Code: Result:gparser = GraphParser() NodeID = word-n1file = example-graid1.xml Annotation(word, a-112)file_stream = codecs.open(file, r, utf-8) Annotation(graid1, a-508)g = gparser.parse(file_stream) compfor node in g.nodes: NodeID = word-n2 print(node) Annotation(word, a-113) for annotation in node.annotations: Annotation(graid1, a-509) print(annotation) deti graid1 = annotation.features.get(graid1) NodeID = word-n3 if graid1 is not None: Annotation(word, a-114) print(graid1) Annotation(graid1, a-510) np.h:s=cop:predp Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  13. 13. Graf-python (3/3) Utterance 1 Region [0-20] Word-n1 Word-n2 Region [0 2] Region [3 7] „ki“ „comp“ „yag“ „deti“ word graid1 word graid1Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  14. 14. The future: Usage of graphs● Graph-coloring algorithm to provide insight on LD data ● make common subgraphs visible after merge of corpora● Graph-traversal algorithms to collect statistical data ● Clusters of annotation values● Weighted graphs to reflect links between sources ● Quantitative Historical Linguistics with dictionaries ● Linked via spanish translations Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  15. 15. Thank you for your attention!Centro Interdisciplinar de Documentação Linguística e Social Minde/Portugal Vera Ferreira, vferreira@cidles.eu Peter Bouda, pbouda@cidles.eu António Lopes, alopes@cidles.euCentro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012
  16. 16. Links● Poio (API): http://media.cidles.eu/poio/● ISO 24612: http://www.iso.org/iso/catalogue_detail.htm?csnumb● The Language Archive:http://tla.mpi.nl/● Weblicht: http://weblicht.sfs.uni-tuebingen.de/index.shtml Centro Interdisciplinar de Documentação Linguística e Social, http://www.cidles.eu @ ACRH-2, Lisbon, 29.11.2012

×