Querying GrAF data in linguistic analysis


Published on

The “Graph Annotation Framework” (GrAF) defines an API and an XML format to store and query linguistic annotations as annotation graphs. The format was standardized as ISO 24612 in 20121, and was explicitly developed as an underlying data model for linguistic annotations in a radical stand-off approach2 ([Ide and Suderman 2007]). The basic data structures are annotation graphs as proposed in [Bird and Liberman 2001], and are general and expressive enough to encode all known varieties of annotation in linguistics and other “annotation-based” disciplines. Although GrAF is not a TEI-compatible format, both standards share a certain technological foundation and grew in a similar ecosystem, but with slightly different applications in mind. In our talk we will show the connections between TEI and GrAF, propose an option to convert between the „two worlds“, and demonstrate a query system for GrAF data that we already use in typological analysis of annotated data from language documentation projects.

Published in: Sports, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Querying GrAF data in linguistic analysis

  1. 1. Querying GrAF data in linguistic analysis Peter Bouda Centro Interdisciplinar de Documentação Linguística e Social pbouda@cidles.eu
  2. 2. Overview ● Existing infrastructure and workflows ● GrAF ● GrAF and TEI ● Poio API ● Queries in Poio API ● Queries in GrAF API
  3. 3. Fieldwork Fotos
  4. 4. Existing Infrastructure
  5. 5. LD tools and standards ● Elan: EAF, MPEG, WAV ● Toolbox: TXT, XML, WAV ● Arbil: IMDI/CIMDI („Component MetaData Infrastructure“) ● Praat: XML, WAV ● ... ● No standards for tier hierarchies, tier names or annotation schemes ● Efforts in ISOcat
  6. 6. Interlinear Glossed Text
  7. 7. GrAF ● GrAF: Graph Annotation Framework ● ISO 24612: Language resource management - Linguistic annotation framework (LAF) ● Started as stand-off version of XCES ● API and representation as data structures, not a file format ● GrAF/XML as XML representation ● Used for the MASC of the ANC ● Nodes, edges, regions, annotations, feature structures
  8. 8. GrAF entities
  9. 9. GrAF structure
  10. 10. GrAF-XML <node xml:id="words..W-Words..na23"> <link targets="words..W-Words..ra23"/> </node> <region anchors="780 1340" xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>
  11. 11. TEI and GrAF ● Schemata for GrAF created with TEI Roma ● Custumized version of TEI P5 schema ● ODD: „One Document Does it all“ ● GrAF is not TEI compliant ● Share data types and feature structures of annotations ● TEI has „stand-off“ variant, uses XPointer/XLink – Primary data has to be XML
  12. 12. Why we use GrAF ● No inline markup ● Radical stand-off approach – Easier to share and manage data – Preferred solution to archive cultural heritage – Ideal for sparse annotations ● Existing code: Java and Python ● API vs. XQuery ● The beauty of annotation graphs
  13. 13. Poio API ● Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages ● Subset of GrAF to represent tier based annotation ● Filters and filter chains for search ● Plugin mechanism for file formats – Mapping semantics: tiers and annotations to nodes and edges ● Efforts to map between TEI and GrAF – Retro-digitized dictionary data at University of Marburg are published as GrAF files – We want to publish as TEI
  14. 14. Queries in GrAF API ● All queries are in-memory ● Users can load parts of the full graph ● Annotation graph to network conversion – Python library networkx ● Example: Semantic similarity
  15. 15. Queries in GrAF API for (node_id, node) in graf_graph.nodes.items(): if node_id.endswith("entry"): for e in node.out_edges: if e.annotations.get_first().label == "head" or e.annotations.get_first().label == "translation": features = e.to_node.annotations.get_first().features substr = features.get_value("substring") [...]
  16. 16. Queries in Poio API ● Example: Word order in Hinuq
  17. 17. Queries in Poio API ag = from_excel("data/Hinuq2.csv") clause_unit_nodes = ag.nodes_for_tier("clause_id") verbs = [ 'COP', 'cop', 'SAY', 'say', 'v.tr', 'v.intr', 'v.aff' ] others = [ 'A', 'S', 'P', 'EXP', 'STIM' ] search_terms = verbs + others word_orders = collections.defaultdict(int) for parent_node in clause_unit_nodes: word_order = [] for word_n in parent_node.iter_children(): a_list = ag.annotations_for_tier("grammatical_relation", word_n) if len(a_list) > 0: a_value = ag.annotation_value_for_annotation(a_list[0]) if a_value in search_terms: if a_value in verbs: word_order.append('V') else: word_order.append(a_value) word_orders[tuple(word_order)] += 1
  18. 18. Filters and filter chains ag = poioapi.annotationgraph.AnnotationGraph() ag.from_elan("elan-example3.eaf") ag.structure_type_handler = poioapi.data.DataStructureType(ag.tier_hierarchies[0]) af = poioapi.annotationgraph.AnnotationGraphFilter(ag) af.set_filter_for_tier("words..W-Words", "follow") af.set_filter_for_tier("part_of_speech..W-POS", r"bprob") ag.append_filter(af) print("Filtered root nodes:") print(ag.filtered_node_ids) search_terms = { "words..W-Words": "follow", "part_of_speech..W-POS": r"bprob" } af = ag.create_filter_for_dict(search_terms) ag.append_filter(af)
  19. 19. Poio Analyzer ● Developed for and with Prof. Johannes Helmbrecht, University of Regensburg ● How to query the corpus in order to write a descriptive grammar? ● Started with a list of requirements ● Need to publish and archive queries and results
  20. 20. Poio Analyzer
  21. 21. Thank you for your attention! pbouda@cidles.eu
  22. 22. Links Clarin curation project: http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthr Poio: http://media.cidles.eu/poio/ GrAF: http://www.xces.org/ns/GrAF/1.0/