Poio API and GraF-XML @ Balisage 2013

412 views

Published on

Language documentation projects all over the world have accumulated a large and heterogeneous corpus of linguistic material. Because of its diversity, access to and analysis of the components is difficult, particularly for multimedia instances. The "Graph Annotation Framework" (GrAF), a standoff annotation method, is applied to utterance examples in time-aligned annotations of video samples. An easy-to-use programming interface defined in the Poio API, a project within the CLARIN framwork ("Common Language Resources and Technology Infrastructure"), then greatly simplifies access without the need to deal with multiple input formats in the source material. GrAF-XML provides a basis for exchanging results among the various projects that analyze the corpus.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
412
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Poio API and GraF-XML @ Balisage 2013

  1. 1. Poio API and GrAF-XML A radical stand-off approach in language documentation and language typology Jonathan Blumtritt, Cologne Center for eHumanities, University of Cologne Peter Bouda, Centro Interdisciplinar de Documentação Linguística e Social Felix Rau, Department of Linguistics, University of Cologne
  2. 2. Overview ● Existing infrastructure and workflows ● CLARIN ● Annotation graphs ● GrAF and Poio API ● Example: Elan EAF to GrAF-XML ● CLASS
  3. 3. Fieldwork Fotos
  4. 4. Existing Infrastructure
  5. 5. LD tools and standards ● Elan: EAF, MPEG, WAV ● Toolbox: TXT, XML, WAV ● Arbil: IMDI/CIMDI („Component MetaData Infrastructure“) ● Praat: XML, WAV ● ... ● No standards for tier hierarchies, tier names or annotation schemes ● Efforts in ISOcat
  6. 6. ● European initiative within the European Research Infrastructure Consortium: Common Language Resources and Technology Infrastructure (CLARIN) ● aims at providing easy and sustainable access for scholars in the humanities and social sciences to digital language data ● Started in 2006, part of a roadmap process, timeline currently ending 2020 ● CLARIN-D: working groups in Germany ● Curation projects for different research areas in linguistics
  7. 7. Annotation Graphs ● the underlying data model for linguistic annotations ● pivot structure for linguistic data ● time vs. byte offsets ● not hierarchical (but trees are also graphs) ● stand-off annotation ● "It is important to recognize that translation into AGs does not magically create compatibility among systems whose semantics are different." [Bird & Liberman 2001]
  8. 8. AGs visualized
  9. 9. GrAF ● GrAF: Graph Annotation Framework ● ISO 24612: Language resource management - Linguistic annotation framework (LAF) ● Started as stand-off version of XCES ● API and representation as data structures, not a file format ● GrAF/XML as XML representation ● Used for the MASC of the ANC ● Nodes, edges, regions, annotations, feature structures
  10. 10. TEI and GrAF ● Schemata for GrAF created with TEI Roma ● Custumized version of TEI P5 schema ● ODD: „One Document Does it all“ ● GrAF is not TEI compliant ● Share data types and feature structures of annotations ● TEI has „stand-off“ variant, uses XPointer/XLink – Primary data has to be XML
  11. 11. Why we use GrAF ● Because it's new! :-) ● No inline markup ● Radical stand-off approach – Easier to share and manage data – Preferred solution to archive cultural heritage – Ideal for sparse annotations ● Existing code: Java and Python ● The beauty of annotation graphs
  12. 12. Poio API ● Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages ● Subset of GrAF to represent tier based annotation ● Filters and filter chains for search ● Plugin mechanism for file formats – Mapping semantics: tiers and annotations to nodes and edges ● Meta-data for additional information (tier types etc.)
  13. 13. Example: Mapping of EAF to GrAF-XML
  14. 14. Elan EAF <TIER DEFAULT_LOCALE="en" LINGUISTIC_TYPE_REF="words" PARENT_REF="W-Spch" PARTICIPANT="" TIER_ID="W-Words"> <ANNOTATION> <ALIGNABLE_ANNOTATION ANNOTATION_ID="a23" TIME_SLOT_REF1="ts4" TIME_SLOT_REF2="ts6"> <ANNOTATION_VALUE>so</ANNOTATION_VALUE> </ALIGNABLE_ANNOTATION> </ANNOTATION> <ANNOTATION> [...] </ANNOTATION> </TIER>
  15. 15. GrAF entities
  16. 16. GrAF structure
  17. 17. GrAF-XML <node xml:id="words..W-Words..na23"> <link targets="words..W-Words..ra23"/> </node> <region anchors="780 1340" xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>
  18. 18. Tier hierarchies [ ['utterance..K-Spch'], ['utterance..W-Spch', ['words..W-Words', ['part_of_speech..W-POS'] ], ['phonetic_transcription..W-IPA'] ], ['gestures..W-RGU', ['gesture_phases..W-RGph', ['gesture_meaning..W-RGMe'] ] ], ['gestures..K-RGU', ['gesture_phases..K-RGph', ['gesture_meaning..K-RGMe'] ] ] ]
  19. 19. The code ag = poioapi.annotationgraph.AnnotationGraph() parser = poioapi.io.ElanParser("example.eaf") writer = poioapi.io.graf.Writer() converter = poioapi.io.graf.GrAFConverter(parser, writer) converter.parse() converter.write("example.hdr")
  20. 20. Analysis workflows ● Graph-based methods ● Pipe to scientific Python libraries ● GrAF connectors for major linguistic workflow tools (GATE and Apache UIMA) ● Example: Polysemy in dictionaries ● Example: Counting word orders
  21. 21. CLASS
  22. 22. Thank you for your attention! pbouda@cidles.eu
  23. 23. Links Clarin curation project: http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthr Poio API: http://media.cidles.eu/poio/poio-api/ GrAF: http://www.xces.org/ns/GrAF/1.0/ CLASS: http://class.uni-koeln.de

×