Your SlideShare is downloading. ×
Querying GrAF data in linguistic analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Querying GrAF data in linguistic analysis

253
views

Published on

The “Graph Annotation Framework” (GrAF) defines an API and an XML format to store and query linguistic annotations as annotation graphs. The format was standardized as ISO 24612 in 20121, and was …

The “Graph Annotation Framework” (GrAF) defines an API and an XML format to store and query linguistic annotations as annotation graphs. The format was standardized as ISO 24612 in 20121, and was explicitly developed as an underlying data model for linguistic annotations in a radical stand-off approach2 ([Ide and Suderman 2007]). The basic data structures are annotation graphs as proposed in [Bird and Liberman 2001], and are general and expressive enough to encode all known varieties of annotation in linguistics and other “annotation-based” disciplines. Although GrAF is not a TEI-compatible format, both standards share a certain technological foundation and grew in a similar ecosystem, but with slightly different applications in mind. In our talk we will show the connections between TEI and GrAF, propose an option to convert between the „two worlds“, and demonstrate a query system for GrAF data that we already use in typological analysis of annotated data from language documentation projects.

Published in: Sports, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
253
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Querying GrAF data in linguistic analysis Peter Bouda Centro Interdisciplinar de Documentação Linguística e Social pbouda@cidles.eu
  • 2. Overview ● Existing infrastructure and workflows ● GrAF ● GrAF and TEI ● Poio API ● Queries in Poio API ● Queries in GrAF API
  • 3. Fieldwork Fotos
  • 4. Existing Infrastructure
  • 5. LD tools and standards ● Elan: EAF, MPEG, WAV ● Toolbox: TXT, XML, WAV ● Arbil: IMDI/CIMDI („Component MetaData Infrastructure“) ● Praat: XML, WAV ● ... ● No standards for tier hierarchies, tier names or annotation schemes ● Efforts in ISOcat
  • 6. Interlinear Glossed Text
  • 7. GrAF ● GrAF: Graph Annotation Framework ● ISO 24612: Language resource management - Linguistic annotation framework (LAF) ● Started as stand-off version of XCES ● API and representation as data structures, not a file format ● GrAF/XML as XML representation ● Used for the MASC of the ANC ● Nodes, edges, regions, annotations, feature structures
  • 8. GrAF entities
  • 9. GrAF structure
  • 10. GrAF-XML <node xml:id="words..W-Words..na23"> <link targets="words..W-Words..ra23"/> </node> <region anchors="780 1340" xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>
  • 11. TEI and GrAF ● Schemata for GrAF created with TEI Roma ● Custumized version of TEI P5 schema ● ODD: „One Document Does it all“ ● GrAF is not TEI compliant ● Share data types and feature structures of annotations ● TEI has „stand-off“ variant, uses XPointer/XLink – Primary data has to be XML
  • 12. Why we use GrAF ● No inline markup ● Radical stand-off approach – Easier to share and manage data – Preferred solution to archive cultural heritage – Ideal for sparse annotations ● Existing code: Java and Python ● API vs. XQuery ● The beauty of annotation graphs
  • 13. Poio API ● Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages ● Subset of GrAF to represent tier based annotation ● Filters and filter chains for search ● Plugin mechanism for file formats – Mapping semantics: tiers and annotations to nodes and edges ● Efforts to map between TEI and GrAF – Retro-digitized dictionary data at University of Marburg are published as GrAF files – We want to publish as TEI
  • 14. Queries in GrAF API ● All queries are in-memory ● Users can load parts of the full graph ● Annotation graph to network conversion – Python library networkx ● Example: Semantic similarity
  • 15. Queries in GrAF API for (node_id, node) in graf_graph.nodes.items(): if node_id.endswith("entry"): for e in node.out_edges: if e.annotations.get_first().label == "head" or e.annotations.get_first().label == "translation": features = e.to_node.annotations.get_first().features substr = features.get_value("substring") [...]
  • 16. Queries in Poio API ● Example: Word order in Hinuq
  • 17. Queries in Poio API ag = from_excel("data/Hinuq2.csv") clause_unit_nodes = ag.nodes_for_tier("clause_id") verbs = [ 'COP', 'cop', 'SAY', 'say', 'v.tr', 'v.intr', 'v.aff' ] others = [ 'A', 'S', 'P', 'EXP', 'STIM' ] search_terms = verbs + others word_orders = collections.defaultdict(int) for parent_node in clause_unit_nodes: word_order = [] for word_n in parent_node.iter_children(): a_list = ag.annotations_for_tier("grammatical_relation", word_n) if len(a_list) > 0: a_value = ag.annotation_value_for_annotation(a_list[0]) if a_value in search_terms: if a_value in verbs: word_order.append('V') else: word_order.append(a_value) word_orders[tuple(word_order)] += 1
  • 18. Filters and filter chains ag = poioapi.annotationgraph.AnnotationGraph() ag.from_elan("elan-example3.eaf") ag.structure_type_handler = poioapi.data.DataStructureType(ag.tier_hierarchies[0]) af = poioapi.annotationgraph.AnnotationGraphFilter(ag) af.set_filter_for_tier("words..W-Words", "follow") af.set_filter_for_tier("part_of_speech..W-POS", r"bprob") ag.append_filter(af) print("Filtered root nodes:") print(ag.filtered_node_ids) search_terms = { "words..W-Words": "follow", "part_of_speech..W-POS": r"bprob" } af = ag.create_filter_for_dict(search_terms) ag.append_filter(af)
  • 19. Poio Analyzer ● Developed for and with Prof. Johannes Helmbrecht, University of Regensburg ● How to query the corpus in order to write a descriptive grammar? ● Started with a list of requirements ● Need to publish and archive queries and results
  • 20. Poio Analyzer
  • 21. Thank you for your attention! pbouda@cidles.eu
  • 22. Links Clarin curation project: http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthr Poio: http://media.cidles.eu/poio/ GrAF: http://www.xces.org/ns/GrAF/1.0/