The document discusses representing heterogeneous dialect data from the DBÖ collection in a standardized way. It considers using XML-based standards like TEI to encode the data onomasiologically rather than just semasiologically. Two options - modifying TEI's <entryFree> element or using a TBX-TEI hybrid - are presented but have problems in fully and legitimately representing the data. The conclusion is that TEI needs a dedicated means of encoding onomasiological entries with an appropriate set of elements and attributes to fully capture the lexical information.
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Exploring Onomasiological Data Models for Heterogenous Dialect Lexical Data
1. Jack T. Bowers
Melanie Seltmann
Austrian Academy of Sciences -Austrian Center for Digital Humanities
Exploring data models for heterogenous
dialect data:
the case of explore.bread.AT!
2. Outline of Presentation
Part I: Overview of project & data
Part II: Overview of possible solutions using XML-based
markup standards for representing onomasiological
dialectal language
3. explore.AT!
Overview:
• DBÖ: collection of Bavarian dialectal speech began 1911
• 2015-2016 converted from TUSTEP to TEI
Goals
• Gain cultural and linguistic insights into Bavarian dialects in former
Austro-Hungarian empire;
• Update and improve the existing body of resources by converting to
conform with standards and best practice (ISOcat, ISOconcept, etc.;
• Enhance usability and compatibility of data in order to share with
project partners;
• Integration of semantic web/LOD resources;
4. Project Overview: Datasets
DBÖ@TEI
WBÖ@TEI
BaseX Database
place inventory (TEI-listPlace)
concept inventory(TEI-feature structures)
gram features inventory (TEI-feature structures)
questionnaires (TEI-list)
DBÖ@ema
SQL
BaseX Database
Extracted Topical Datasets
explore.bread
The language of Color
lexicon(location(a))
inventory(lexicalFeature(a))
• Domain/Topic-based (exploreBread)
• Location
• Lexical/grammatical features
Possible basis for examination of sub-datasets
6. DBÖ Questionnaires
Questionnaires:
While questionnaires are topical in general, they are a complicated
mixture of semasiological (term-based) and onomasiological
(concept-based)
e.g.
(31B5) bes. Weißgebäcke:
länglich flaches, gerundetes Weißgebäck, z.B. Strutz (l.!),
Strutzen, Strützel, Wecken u.a.; scherzhafte Bez. wie Schendarm
Current means of extracting this information were initially limited to:
• Questionnaires
• String searches in certain data fields
Dataset requires significant manual editing and curation due to nature of
the questionnaires
7. Desired Enhancements
In most sub-topical studies such as ExploreBread! there would be
potential benefits of having the ability to format data onomasiologically,
for example:
• Domain and/or concept-oriented entries better represent the content of
interest
• Information retrieval
• Ontology mapping
• Etymological &/or Morphosyntactic analysis
• Cross linguistic (or dialectal) comparisson or translation
Problem:
> TEI has no explicitly designated means of
encoding onomasiological data!
9. Lexical Organization
Semasiological:
Onomasiological:
Semasiological Lexical Model
meaning(iii)
Form
meaning(ii)meaning(i)
Onomasiological Lexical Model
Concept
Form(i) Form(ii) Form(iii)
Starting point is word form and identifies
associated meanings and senses
Starting point is a concept and looks at forms
used to represent it
10. Headword
Lemma(i..n)
BROT
brot broet brɛot
Prôt Prôt Prôt
Core DBÖ entry datatypes
—————————————-
Archive record
Headword (Form)
POS
Dialect lemma (Form)
Gram info
Meaning (Sense)
Usage example
Source
Place
Questionnaire
Etymology
Desired Data Structure
Desired Onomasiological Model for Extracted
Terminological DBÖ Datasets
TermEntry
Concept(a)
DialectEntry(i) DialectEntry(ii) DialectEntry(n)
11. Options using XML-Based Standards
(i) TEI Hacks: Alternate TEI Dictionary format (<entryFree>)
(ii) TEI-TBX Hybrid (Romary, 2014)
OR…. use TEI P4
12. TEI <entryFree> Model
(1…n)
<sense @corresp/>
<entryFree @xml:id>
<usg @type=“dom”>
<superEntry>
<entry @xml:id @xml:lang=“bar”>
(0…n)
(1…n)
<form type=“hauptlemma”>
<orth>
(1…n)
(1…1)
<form type=“hauptlemma”>
(all other elements content from original copied without alteration)
<def @xml:lang>
(0…n)
<sense>
concept:
meaning
concept:
domain
Form (headword(i))
Form (dialect(a))
Metadata:
DBÖ entry (headword (i))
Form (headword(ii))
Form (dialect(b))
Metadata:
DBÖ entry (headword (ii))
13. TEI <entryFree> Model
concept:
meaning
<entryFree>
<sense corresp="concept:Wecken">
<usg type="dom" corresp="concept:Brot">Brot</usg>
<def xml:lang="en" resp="#JB">Oblong loaf of bread</def>
</sense>
<superEntry> <!—for each unique hauptlemma for concept entry —>
<form type="hauptlemma">
<orth>Wecken</orth>
</form>
<entry xml:id="w834_qdb-d1e602b" xml:lang="bar">
<!-- hauptlemma removed from here; entry content abbreviated -->
<form type="lautung" n="1">
<pron notation="tustep">W.eiggn</pron>
<pron notation="ipa" resp="#JB" change=“01.2">ʋɛiggn̩</pron>
</form>
<usg type="geo">
<placeName>St.Michael/B. Bgl.</placeName>
</usg>
</entry>
<!—all entries with headword “Wecken” (ii..n) —> </superEntry>
<superEntry>
<form type="hauptlemma">
<orth>Strutzen</orth>
</form>
<entry xml:id="s806_qdb-d1e43847b" xml:lang="bar">
<!-- hauptlemma removed from here; entry content abbreviated -->
<form type="lautung" n="1">
<pron notation="tustep">Struzn</pron>
<pron notation="ipa" resp="#JB" change=“01.2">ʃtruzn̩</pron>
</form>
<usg type="geo">
<placeName>Rohrb. OÖ</placeName>
</usg>
</entry>
<!—all entries with headword “Strutzen” (ii..n) —> </superEntry>
</entryFree>
concept:
domain
Form (headword(i))
Form (dialect(a))
Metadata:
DBÖ entry (headword (i))
Form (headword(ii))
Form (dialect(b))
Metadata:
DBÖ entry (headword (ii))
14. Problems with <entryFree> model
• It is a hack!
• Current TEI guidelines and data model are
inherantly and intentionallly semasiological and
this use of the vocabulary is only valid by chance,
not intention.
>Thus using this data model within the TEI will not have
any of the advantages that generally come with its use
15. TBX-TEI Hybrid
Romary (2014):
Makes attempt at customizing TEI guidelines to incorporate TBX
(ISO 30046) terminological entries in order to provide TEI with an
onomasiological model
https://github.com/laurentromary/TBXinTEI
16. TBX-TEI Hybrid
<tbx:termEntry xmlns="http://www.tbx.org"><!-- @xml:id; -->
<descrip type="concept" target="concept:Wecken"/> <!-- sense not normally included in TBX! -->
<descrip type="domain" target="concept:Brot" xml:lang="de">Brot</descrip>
<descrip type="definition" xml:lang="en">Oblong loaf of bread</descrip>
<!-- no headword form may occur outside of <langSet>—>
<langSet xml:id="w834_qdb-d1e602" xml:lang="bar-x-smichael"><!-- language/dialect i) @xml:id; -->
<!-- No sense allowed! —>
<tei:note type="anmerkung" resp="O" corresp="#BD">deren Grundriß ein Oval ist</tei:note>
<!-- @corresp allowed in TEI <note> but not here —>
<!-- Most metadata element valid using <tei:ref> but syntactically required to occur before <tig> —>
<admin type="geo">
<tei:placeName>St.Michael/B. Bgl.</tei:placeName>
</admin>
<tig><!-- <tei:form> would be better -->
<tei:term type="hauptlemma">Wecken</tei:term>
<termNote type="transcription">orth</termNote><!-- this is inefficient: need to allow <orth> & <pron>—>
<termNote type="pos">Subst</termNote><!-- this actually should be applicable to all forms (headword & lemmas) -->
</tig>
<tig>
<tei:term type="lautung" n="1">W.eiggn</tei:term>
<termNote type="transcription">pron</termNote>
<termNote type="notation">tustep</termNote><!-- we also need to allow @notation -->
</tig>
<tig><!-- TBX doesn't allow multiple instances of <term> in same <tig> as TEI does with <orth>,<pron> w/in <form> -->
<tei:term type="lautung" n="1" resp="#JB">ʋɛiggn̩</tei:term>
<termNote type="transcription" change=“1.2">pron</termNote><!-- @change in original not allowed in hybrid schema -->
<termNote type="notation">ipa</termNote>
</tig>
</langSet>
….
17. Problems with TEI-TBX Hybrid model as
per the ODD Schema from Romary (2014)
• <tig> is verbose and would be better replaced with <form>
• the order of occurence of elements is too restricted
• TBX-dominated schema lacks way too many attributes (e.g.
@notation),and elements (e.g. <orth> <pron>) that are key
to storage and representation of lexical data as used in TEI
18. Conclusion
(i) TEI lacks a legitimate means of encoding terminological/
onomasiological entries;
(ii) Given that we need to include sense (or a parallel equivalent) and
the headword at the top of an entry, a TBX-TEI hybrid doesn’t work
either without serious modification via ODD mostly to introduce
elements and features from TEI, and stretching the traditional usage
of the system;
(iii) TEI needs to re-introduce a means of onomasiological data
representation (such as <termEntry>) but with an expanded set of
elements and attributes based on the degree of expressivity in the
Dictionary module