Poio API and GraF-XML @ Balisage 2013

  • 262 views
Uploaded on

Language documentation projects all over the world have accumulated a large and heterogeneous corpus of linguistic material. Because of its diversity, access to and analysis of the components is …

Language documentation projects all over the world have accumulated a large and heterogeneous corpus of linguistic material. Because of its diversity, access to and analysis of the components is difficult, particularly for multimedia instances. The "Graph Annotation Framework" (GrAF), a standoff annotation method, is applied to utterance examples in time-aligned annotations of video samples. An easy-to-use programming interface defined in the Poio API, a project within the CLARIN framwork ("Common Language Resources and Technology Infrastructure"), then greatly simplifies access without the need to deal with multiple input formats in the source material. GrAF-XML provides a basis for exchanging results among the various projects that analyze the corpus.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
262
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Poio API and GrAF-XML A radical stand-off approach in language documentation and language typology Jonathan Blumtritt, Cologne Center for eHumanities, University of Cologne Peter Bouda, Centro Interdisciplinar de Documentação Linguística e Social Felix Rau, Department of Linguistics, University of Cologne
  • 2. Overview ● Existing infrastructure and workflows ● CLARIN ● Annotation graphs ● GrAF and Poio API ● Example: Elan EAF to GrAF-XML ● CLASS
  • 3. Fieldwork Fotos
  • 4. Existing Infrastructure
  • 5. LD tools and standards ● Elan: EAF, MPEG, WAV ● Toolbox: TXT, XML, WAV ● Arbil: IMDI/CIMDI („Component MetaData Infrastructure“) ● Praat: XML, WAV ● ... ● No standards for tier hierarchies, tier names or annotation schemes ● Efforts in ISOcat
  • 6. ● European initiative within the European Research Infrastructure Consortium: Common Language Resources and Technology Infrastructure (CLARIN) ● aims at providing easy and sustainable access for scholars in the humanities and social sciences to digital language data ● Started in 2006, part of a roadmap process, timeline currently ending 2020 ● CLARIN-D: working groups in Germany ● Curation projects for different research areas in linguistics
  • 7. Annotation Graphs ● the underlying data model for linguistic annotations ● pivot structure for linguistic data ● time vs. byte offsets ● not hierarchical (but trees are also graphs) ● stand-off annotation ● "It is important to recognize that translation into AGs does not magically create compatibility among systems whose semantics are different." [Bird & Liberman 2001]
  • 8. AGs visualized
  • 9. GrAF ● GrAF: Graph Annotation Framework ● ISO 24612: Language resource management - Linguistic annotation framework (LAF) ● Started as stand-off version of XCES ● API and representation as data structures, not a file format ● GrAF/XML as XML representation ● Used for the MASC of the ANC ● Nodes, edges, regions, annotations, feature structures
  • 10. TEI and GrAF ● Schemata for GrAF created with TEI Roma ● Custumized version of TEI P5 schema ● ODD: „One Document Does it all“ ● GrAF is not TEI compliant ● Share data types and feature structures of annotations ● TEI has „stand-off“ variant, uses XPointer/XLink – Primary data has to be XML
  • 11. Why we use GrAF ● Because it's new! :-) ● No inline markup ● Radical stand-off approach – Easier to share and manage data – Preferred solution to archive cultural heritage – Ideal for sparse annotations ● Existing code: Java and Python ● The beauty of annotation graphs
  • 12. Poio API ● Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages ● Subset of GrAF to represent tier based annotation ● Filters and filter chains for search ● Plugin mechanism for file formats – Mapping semantics: tiers and annotations to nodes and edges ● Meta-data for additional information (tier types etc.)
  • 13. Example: Mapping of EAF to GrAF-XML
  • 14. Elan EAF <TIER DEFAULT_LOCALE="en" LINGUISTIC_TYPE_REF="words" PARENT_REF="W-Spch" PARTICIPANT="" TIER_ID="W-Words"> <ANNOTATION> <ALIGNABLE_ANNOTATION ANNOTATION_ID="a23" TIME_SLOT_REF1="ts4" TIME_SLOT_REF2="ts6"> <ANNOTATION_VALUE>so</ANNOTATION_VALUE> </ALIGNABLE_ANNOTATION> </ANNOTATION> <ANNOTATION> [...] </ANNOTATION> </TIER>
  • 15. GrAF entities
  • 16. GrAF structure
  • 17. GrAF-XML <node xml:id="words..W-Words..na23"> <link targets="words..W-Words..ra23"/> </node> <region anchors="780 1340" xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>
  • 18. Tier hierarchies [ ['utterance..K-Spch'], ['utterance..W-Spch', ['words..W-Words', ['part_of_speech..W-POS'] ], ['phonetic_transcription..W-IPA'] ], ['gestures..W-RGU', ['gesture_phases..W-RGph', ['gesture_meaning..W-RGMe'] ] ], ['gestures..K-RGU', ['gesture_phases..K-RGph', ['gesture_meaning..K-RGMe'] ] ] ]
  • 19. The code ag = poioapi.annotationgraph.AnnotationGraph() parser = poioapi.io.ElanParser("example.eaf") writer = poioapi.io.graf.Writer() converter = poioapi.io.graf.GrAFConverter(parser, writer) converter.parse() converter.write("example.hdr")
  • 20. Analysis workflows ● Graph-based methods ● Pipe to scientific Python libraries ● GrAF connectors for major linguistic workflow tools (GATE and Apache UIMA) ● Example: Polysemy in dictionaries ● Example: Counting word orders
  • 21. CLASS
  • 22. Thank you for your attention! pbouda@cidles.eu
  • 23. Links Clarin curation project: http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthr Poio API: http://media.cidles.eu/poio/poio-api/ GrAF: http://www.xces.org/ns/GrAF/1.0/ CLASS: http://class.uni-koeln.de