SlideShare a Scribd company logo
1 of 29
Download to read offline
Ivan Derzhanski, Natalia Kotsyba


     Towards a consistent
     morphological tagset
      for Slavic languages:
   Extending MULTEXT-East
     for Polish, Ukrainian
         and Belarusian
Ivan Derzhanski, Natalia Kotsyba

Towards a consistent morphological
   tagset for Slavic languages:
 Extending MULTEXT-East for
Polish, Ukrainian and Belarusian
    (… as well as Upper
    and Lower Sorbian)
The main idea
Morphosyntactic annotation is
needed by theoretical,
computational and applied
linguistics alike.
Morphosyntactic annotation
requires a tagset (ideally one
consistent with linguistic theory and
then with grammatical tradition).
The main idea (continued)

Comparative theoretical studies,
morphosyntactic annotation in
parallel corpora, bi- and
multilingual dictionary making
all require a common,
crosslinguistically consistent
tagset.
… that is, a tagset which

treats like phenomena in like ways,
treats unlike phenomena in unlike
ways,
reflects the structural, etymological
and semantic unity of grammatical
categories to the greatest extent
(especially in the case of closely
related languages).
MULTEXT-East
11 tagsets developed in v.3, with 3 more added in v.4:
  Indo-European
            Slavic
       −

                  East (1): RU
                  West (2): CS, SK
                  South (4): SL, SL-R, HR, SR,
                  BG, MK
            non-Slavic (3): EN, RO, FA
       −

  Uralic (2): HU, ET
MULTEXT-East: virtues

 Intended to be a multilanguage tagset
 from the beginning.
 Already de facto standard for several
 languages.
Thus a natural starting point for further
 work in this field.
MULTEXT-East:
same phenomenon, different treatment
  attributive participles
          verb forms (BG, RU),
      −

          adjectives (CS, SK);
      −

  adverbial participles
          V, Vform=gerund (BG),
      −

          V, Vform=transgressive (CS, SK),
      −

          R, Type=verbal|causal (HU).
      −
MULTEXT-East:
same phenomenon, different treatment
  virile (masculine human) forms of
  numerals
         BG двама: Form=m_form
     −

         SK dvaja: Form=letter
     −
MULTEXT-East:
similar phenomenon, different treatment
   short:long forms of adjectives
           Formation=nominal:compound (CS),
       −

           Definiteness=no:yes (SL),
       −

           Definiteness=short_art:full_art (RU),
       −

           ?? (BG).
       −
MULTEXT-East:
same term, different content
M, Type=multipl[icative]
        adverbial: dvakrát (CS),
    −

        adjectival: dvojen (SL)
    −


V, Definiteness: HU vs BG
N, Case=genitive: FA vs everything else
MULTEXT-East:
language-specific solutions
Clitic_s (CS)
extra cases (RU)
RU цвет чая ~ чашка чаю,
шишка в снегу ~ вдохновение в снеге;
UA муха в меді ~ зварено на меду,
краснопера (individual) ~ красноперу (species);
BE пераезда (place) ~ пераезду (act);
CS bratrovi ~ bratru Janovi
That national grammatical traditions
 have often been followed is
 understandable.
But comparative work requires a
 common theoretical ground, the lack
 of which defeats the purpose of a
 common tagset.
So some traditional propositions will
 have to be sacrificed.
Moreover, traditional grammar can be
            inconsistent.

Bulgarian:
           водата ми ‘my water, το νερό μου’
      −

           дай ми ‘give me, δώς μου’
      −



Slovak:          2nd singular    reflexive
personal         tebou           sebou
possessive       tvoj            svoj
Priorities for a pan-Slavic tagset

  crosslinguistic consistency,
  linguistic adequacy,
  compactness.
Agglutination

A mechanism definitely needed for the
 floating copula (as well as other clitics) in
 Polish.
          powinniście znać
      −

          słyszeliście
      −

          gdzieżeście słyszeli …?
      −

          czybyście uwierzyli …?
      −

          po coście mnie tu przyniosły?
      −
Agglutination (continued)

Will also do for:
        Czech
    −

             floating -s < jsi (currently Clitic_s for
             verbs and pronouns only),
             aby, kdeby + -ch(om), -s(te) (currently
             inflecting particles);
        Czech/Slovak -že;
    −

        adposition+pronoun compounds:
    −
        PL przezeń, HS tohodla, etc.
Additional features needed

N, V, A, P, M: Virile
(for SK, PL, UA, BE, HS, DS, BG)
N: CaseForm (first, second)
(currently Case2=p|l)
V: Agglutinativity, Vocalicity
Additional features needed

A: Voice, Negation (for participles)
A: Owner_Gender (for Sorbian)
HS stareje žoniny syn
DS našogo nanowe crjeje
P: Post-prepositional (by any name)
HS jón ~ njón, što ~ čo
RU ниже них ~ ниже их
S: Vocalicity
Additional values needed

N|Type: gerund (for Polish at least)
        Aspect, Negation
    −

N|Gender: common
V|Aspect: biaspectual
V|Person: inclusive (for Russian)
A|Type: participle (etc.), pre-adjectival
P|Person: reflexive
Conversion of existing formats for Polish
  and Ukrainian to an MTE-like format
Resources for morphological processing of Polish and Ukrainian have been
developed independently from the project MTE in Poland (IPI—PAS corpus)
and Ukraine (ULIF NASU corpus), respectively.
Morphological information is encoded in the form of grammatical
dictionaries that allow for both analysing and synthesising word forms.
The granulation of grammatical information there and the formats of
recording it differ considerably from the core MTE tagset.
Grammatical categories and values overlap (are one-to-one relations) only in
part; some of them have to be decomposed into finer ones, and new
categories/values need to be assigned to all relevant lexemes in a
grammatical dictionary.
On the other hand, grammatical dictionaries contain information that is not
necessary for MTE-like tagging.
Conversion of existing formats for Polish
 and Ukrainian to an MTE-like format
Two possible levels of introducing changes into Polish and
Ukrainian grammatical sources: level of conversion of tagged
texts, or directly in the dictionary source files.
Polish source files are not available for processing and
development.
Ukrainian: additional grouping of lexemes is done within
UGTag, morphological tagger with the possibility of adding
new words from tagged texts, unrecognised by the tagger.
One possible output format of UGTag will be an MTE-like
tagged text.
Belarusian: a grammatical dictionary is under development
now on the basis of an extensive orthographic dictionary;
suggestions concerning its design and compatibility with
MTE-like tagging format can be taken into account, no further
conversion will be required.
Conversion of existing formats for Polish
 and Ukrainian to an MTE-like format
 The tagsets for Polish (IPIC) and Ukrainian (UGD) were
 brought together within the PolUKR project with the aim
 of creating a common tagset for the parallel corpus of
 those languages.
 The criterion of minimal information loss was used,
 although the common tagset is not a pure arithmetic sum
 of the two tagsets.
 it was based on the pattern of IPIC, as it was easier this
 way to adjust the search program Poliqarp for the needs of
 PolUKR.
 Since MTE-like tagging is becoming a standard now, it
 was decided to bring the PolUKR tagset to conformity
 with it.
Fragment of the conversion table IPIC/PolUKR
     → MTE v.3/4 (111 dictionary positions):
English term                        PolUKR     MTE tag (fragment)        example
                                    tag
                                    qub        Q                         niech
particle-adverb
                                    dsc        Q                         властиво
discourse markers
                                    inf        V, VForm=n                спатоньки
infinitive
                                    imps       V, VForm=t                rozpoczęto, robiono
impersonal form
adverbial participle                part       V, VForm=r

simultaneous adverbial participle   pcon       V, VForm=r, Tense=p       роблячи, robiąc

anterior adverbial participle       pant       V, VForm=r, Tense=a,      зробивши, zrobiwszy
                                               Aspect=e
simultaneous past participle        ppast      V, VForm=r, Tense=a,      робивши, *robiwszy (rare)
                                               Aspect=p
common (general) noun               gnoun      N, Type=c                 шахи
proper name                         propnoun   N, Type=p                 Сколе
disparaging (depreciative) noun     depr       N, Animate=y, Human=n     profesory

1st- or 2nd-person pro-noun         ppron12    P, Type=p, Person=(1|2)   я, ти

gerund                              ger        N, Type=g                 robienie, nierobienie
                                                                         niezrobienie
3rd-person pro-noun                 ppron3     P, Type=p, Person=3       він, вони
And a fragment of the correspondence table
MTE v.3/4 → IPIC/PolUKR (332 positions):

             attribute value value name IPIC/PolUKR equivalent
category
                       code
Adjective(A) Aspect    E     perfective  (pact|pass)&aspect=perfective
                             progressive (pact|pass)&aspect=imperfective
Adjective(A) Aspect    P

Adjective(A) Voice     A     active       pact&aspect=perfective
Adjective(A) Voice     P     passive      pass&aspect=perfective
Adverb (R)             R                  adv|adjp|pred

Verb(V)      VForm     I     indicative   fin|praet|bedzie
Verb(V)      Tense     P     present      fin&aspect=imperf
Verb(V)      Tense     F     future       bedzie|(fin&aspect=perf)
A fragment of the XML specification file for
   Ukrainian compatible with the MTE-4
           proposal for Russian:

   <row role=quot;attributequot;>
     <cell xml:lang=quot;enquot; role=quot;positionquot;>6</cell>
     <cell role=quot;namequot; xml:lang=quot;enquot;>Case2</cell>
     <cell xml:lang=quot;enquot; role=quot;valuesquot;>
       <table>
         <row role=quot;valuequot;>
            <cell role=quot;namequot; xml:lang=quot;enquot;>genitive</cell>
            <cell role=quot;codequot; xml:lang=quot;enquot;>g</cell>
         </row>
          <row role=quot;valuequot;>
            <cell role=quot;namequot; xml:lang=quot;enquot;>dative</cell>
            <cell role=quot;codequot; xml:lang=quot;enquot;>d</cell>
         </row>
          <row role=quot;valuequot;>
            <cell role=quot;namequot; xml:lang=quot;enquot;>locative</cell>
            <cell role=quot;codequot; xml:lang=quot;enquot;>l</cell>
          </row>
       </table>
     </cell>
   </row>
The same fragment for Ukrainian
  according to our proposals:
     <row role=quot;attributequot;>
       <cell xml:lang=quot;enquot; role=quot;positionquot;>6</cell>
       <cell role=quot;namequot; xml:lang=quot;enquot;>CaseForm</cell>
       <cell xml:lang=quot;enquot; role=quot;valuesquot;>
         <table>
            <row role=quot;valuequot;>
              <cell role=quot;namequot; xml:lang=quot;enquot;>first</cell>
              <cell role=quot;codequot; xml:lang=quot;enquot;>1</cell>
           </row>
            <row role=quot;valuequot;>
              <cell role=quot;namequot; xml:lang=quot;enquot;>second</cell>
              <cell role=quot;codequot; xml:lang=quot;enquot;>2</cell>
            </row>
            <row role=quot;valuequot;>
              <cell role=quot;namequot; xml:lang=quot;enquot;>third</cell>
              <cell role=quot;codequot; xml:lang=quot;enquot;>3</cell>
            </row>
         </table>
       </cell>
     </row>
Conclusions and recommendations
General agreement on the tagset to be achieved among its
developers; a common ground must be found.
In its current state the MTE tagset includes information
from different levels of language description: purely
morphological, derivational, syntactic and semantic.
Syntactic and semantic analysis and tagging are further
necessary steps in language description, and principles of
tagging for them should be developed.
The layer of derivation is significant for (semi)automatic
lexicon development.
Information currently encoded about levels other than the
morphological one (such as valency for prepositions or
classification of pronoun types) should also be
redistributed in the future.
Дякуємо за увагу             Благодарим за внимание

Дзякуем за ўвагу             Благодарим за вниманието

Dziękujemy za uwagę          Благодараме за вниманието

                             Bug lunej za to atencjun
Dźakujemoj so za kedźbnosć

Źěkujomej se za zajmowanosć Mulţumim pentru atenţie

Ďakujemе za pozornosť Thank you for your attention
Děkujeme za pozornost Täname tähelepanu eest
                             Köszönjük a figyelmet
Zahvaliva se za pozornost

                               ‫از دقتتان سپاسگزاريم‬
Zahvaljujeme na pažnju

More Related Content

Viewers also liked

Who Wants To Be A Millionaire
Who Wants To Be A MillionaireWho Wants To Be A Millionaire
Who Wants To Be A MillionaireAgata Klimczak
 
[Content Creation]: 21 Types of Content
[Content Creation]: 21 Types of Content[Content Creation]: 21 Types of Content
[Content Creation]: 21 Types of ContentLiz Azyan
 
From LARP'ing to Libraries
From LARP'ing to LibrariesFrom LARP'ing to Libraries
From LARP'ing to LibrariesDaniel Chilla
 
120322 foredrag soc med o ungdommar daniel chilla
120322 foredrag soc med o ungdommar daniel chilla120322 foredrag soc med o ungdommar daniel chilla
120322 foredrag soc med o ungdommar daniel chillaDaniel Chilla
 
Some American And Some British Buildings And Bridges
Some American And Some British Buildings And BridgesSome American And Some British Buildings And Bridges
Some American And Some British Buildings And BridgesAgata Klimczak
 
Engineering Careers Ee D214 11 09 Pptcopy Pdf
Engineering Careers Ee D214 11 09 Pptcopy PdfEngineering Careers Ee D214 11 09 Pptcopy Pdf
Engineering Careers Ee D214 11 09 Pptcopy PdfRFjock
 
Engineer Week2010 Dupage Expo1
Engineer Week2010 Dupage Expo1Engineer Week2010 Dupage Expo1
Engineer Week2010 Dupage Expo1RFjock
 
IEEE_EMCS_DL JDM LABS LLC 1i Pdf
IEEE_EMCS_DL JDM LABS LLC 1i PdfIEEE_EMCS_DL JDM LABS LLC 1i Pdf
IEEE_EMCS_DL JDM LABS LLC 1i PdfRFjock
 
The story as baseline: Timeline, budget och delivarables in transmediaprojekt
The story as baseline: Timeline, budget och delivarables in transmediaprojektThe story as baseline: Timeline, budget och delivarables in transmediaprojekt
The story as baseline: Timeline, budget och delivarables in transmediaprojektDaniel Chilla
 
Class Of 1961 W Music
Class Of 1961 W MusicClass Of 1961 W Music
Class Of 1961 W Musicarpysaunders
 

Viewers also liked (17)

Am And Brit Art
Am And Brit ArtAm And Brit Art
Am And Brit Art
 
Bg-Uk time words
Bg-Uk time wordsBg-Uk time words
Bg-Uk time words
 
Who Wants To Be A Millionaire
Who Wants To Be A MillionaireWho Wants To Be A Millionaire
Who Wants To Be A Millionaire
 
[Content Creation]: 21 Types of Content
[Content Creation]: 21 Types of Content[Content Creation]: 21 Types of Content
[Content Creation]: 21 Types of Content
 
From LARP'ing to Libraries
From LARP'ing to LibrariesFrom LARP'ing to Libraries
From LARP'ing to Libraries
 
Geek Review
Geek ReviewGeek Review
Geek Review
 
120322 foredrag soc med o ungdommar daniel chilla
120322 foredrag soc med o ungdommar daniel chilla120322 foredrag soc med o ungdommar daniel chilla
120322 foredrag soc med o ungdommar daniel chilla
 
Some American And Some British Buildings And Bridges
Some American And Some British Buildings And BridgesSome American And Some British Buildings And Bridges
Some American And Some British Buildings And Bridges
 
Bg-Uk brief time words
Bg-Uk brief time wordsBg-Uk brief time words
Bg-Uk brief time words
 
Engineering Careers Ee D214 11 09 Pptcopy Pdf
Engineering Careers Ee D214 11 09 Pptcopy PdfEngineering Careers Ee D214 11 09 Pptcopy Pdf
Engineering Careers Ee D214 11 09 Pptcopy Pdf
 
Some Questions
Some QuestionsSome Questions
Some Questions
 
Engineer Week2010 Dupage Expo1
Engineer Week2010 Dupage Expo1Engineer Week2010 Dupage Expo1
Engineer Week2010 Dupage Expo1
 
IEEE_EMCS_DL JDM LABS LLC 1i Pdf
IEEE_EMCS_DL JDM LABS LLC 1i PdfIEEE_EMCS_DL JDM LABS LLC 1i Pdf
IEEE_EMCS_DL JDM LABS LLC 1i Pdf
 
Usa Geography Quiz
Usa Geography QuizUsa Geography Quiz
Usa Geography Quiz
 
Politics Uk
Politics UkPolitics Uk
Politics Uk
 
The story as baseline: Timeline, budget och delivarables in transmediaprojekt
The story as baseline: Timeline, budget och delivarables in transmediaprojektThe story as baseline: Timeline, budget och delivarables in transmediaprojekt
The story as baseline: Timeline, budget och delivarables in transmediaprojekt
 
Class Of 1961 W Music
Class Of 1961 W MusicClass Of 1961 W Music
Class Of 1961 W Music
 

Similar to Towards a consistent morphological tagset for Slavic languages: Extending MULTEXT-East for Polish, Ukrainian and Belarusian

Алексей Чеусов - Расчёсываем своё ЧСВ
Алексей Чеусов - Расчёсываем своё ЧСВАлексей Чеусов - Расчёсываем своё ЧСВ
Алексей Чеусов - Расчёсываем своё ЧСВMinsk Linux User Group
 
The verbal phrase of Northern Sotho: A morpho-syntactic perspective
The verbal phrase of Northern Sotho: A morpho-syntactic perspectiveThe verbal phrase of Northern Sotho: A morpho-syntactic perspective
The verbal phrase of Northern Sotho: A morpho-syntactic perspectiveGuy De Pauw
 
English as a Germanic Language
English as a Germanic LanguageEnglish as a Germanic Language
English as a Germanic LanguageJano Avila Moncada
 
Nominal Collectivity, S-Mass & O-Mass: Syntax, Semantics, and Phonology Inter...
Nominal Collectivity, S-Mass & O-Mass: Syntax, Semantics, and Phonology Inter...Nominal Collectivity, S-Mass & O-Mass: Syntax, Semantics, and Phonology Inter...
Nominal Collectivity, S-Mass & O-Mass: Syntax, Semantics, and Phonology Inter...mmitrovic2
 
Documenting and modeling inflectional paradigms in under-resourced languages
Documenting and modeling inflectional paradigms in under-resourced languages Documenting and modeling inflectional paradigms in under-resourced languages
Documenting and modeling inflectional paradigms in under-resourced languages Katerina Vylomova
 
Lanugage Project
Lanugage  ProjectLanugage  Project
Lanugage ProjectMichelene
 
Prepositions in Czech
Prepositions in CzechPrepositions in Czech
Prepositions in CzechDominik Lukes
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfDeptii Chaudhari
 
Teaching alphabetics and fluency in reading
Teaching alphabetics and fluency in readingTeaching alphabetics and fluency in reading
Teaching alphabetics and fluency in readingMarcia Luptak
 
A Basic Modern Russian Grammar
A Basic Modern Russian Grammar A Basic Modern Russian Grammar
A Basic Modern Russian Grammar Mahmoud
 
2.26 2.27- translating 9
2.26 2.27- translating 92.26 2.27- translating 9
2.26 2.27- translating 9jennarenee153
 
A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and R...
A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and R...A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and R...
A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and R...Svetlin Nakov
 
Word level language identification in code-switched texts
Word level language identification in code-switched textsWord level language identification in code-switched texts
Word level language identification in code-switched textsHarsh Jhamtani
 
My Version
My Version My Version
My Version ktolbert
 

Similar to Towards a consistent morphological tagset for Slavic languages: Extending MULTEXT-East for Polish, Ukrainian and Belarusian (20)

Алексей Чеусов - Расчёсываем своё ЧСВ
Алексей Чеусов - Расчёсываем своё ЧСВАлексей Чеусов - Расчёсываем своё ЧСВ
Алексей Чеусов - Расчёсываем своё ЧСВ
 
The verbal phrase of Northern Sotho: A morpho-syntactic perspective
The verbal phrase of Northern Sotho: A morpho-syntactic perspectiveThe verbal phrase of Northern Sotho: A morpho-syntactic perspective
The verbal phrase of Northern Sotho: A morpho-syntactic perspective
 
English as a Germanic Language
English as a Germanic LanguageEnglish as a Germanic Language
English as a Germanic Language
 
56889.ppt
56889.ppt56889.ppt
56889.ppt
 
Lidia Pivovarova
Lidia PivovarovaLidia Pivovarova
Lidia Pivovarova
 
Nominal Collectivity, S-Mass & O-Mass: Syntax, Semantics, and Phonology Inter...
Nominal Collectivity, S-Mass & O-Mass: Syntax, Semantics, and Phonology Inter...Nominal Collectivity, S-Mass & O-Mass: Syntax, Semantics, and Phonology Inter...
Nominal Collectivity, S-Mass & O-Mass: Syntax, Semantics, and Phonology Inter...
 
Documenting and modeling inflectional paradigms in under-resourced languages
Documenting and modeling inflectional paradigms in under-resourced languages Documenting and modeling inflectional paradigms in under-resourced languages
Documenting and modeling inflectional paradigms in under-resourced languages
 
Lanugage Project
Lanugage  ProjectLanugage  Project
Lanugage Project
 
Syntax
SyntaxSyntax
Syntax
 
Prepositions in Czech
Prepositions in CzechPrepositions in Czech
Prepositions in Czech
 
Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
 
Teaching alphabetics and fluency in reading
Teaching alphabetics and fluency in readingTeaching alphabetics and fluency in reading
Teaching alphabetics and fluency in reading
 
A Basic Modern Russian Grammar
A Basic Modern Russian Grammar A Basic Modern Russian Grammar
A Basic Modern Russian Grammar
 
2.26 2.27- translating 9
2.26 2.27- translating 92.26 2.27- translating 9
2.26 2.27- translating 9
 
Lição nº1
Lição nº1Lição nº1
Lição nº1
 
A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and R...
A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and R...A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and R...
A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and R...
 
Word level language identification in code-switched texts
Word level language identification in code-switched textsWord level language identification in code-switched texts
Word level language identification in code-switched texts
 
2 allomorphy
2 allomorphy2 allomorphy
2 allomorphy
 
2-lecture_01.pdf
2-lecture_01.pdf2-lecture_01.pdf
2-lecture_01.pdf
 
My Version
My Version My Version
My Version
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Towards a consistent morphological tagset for Slavic languages: Extending MULTEXT-East for Polish, Ukrainian and Belarusian

  • 1. Ivan Derzhanski, Natalia Kotsyba Towards a consistent morphological tagset for Slavic languages: Extending MULTEXT-East for Polish, Ukrainian and Belarusian
  • 2. Ivan Derzhanski, Natalia Kotsyba Towards a consistent morphological tagset for Slavic languages: Extending MULTEXT-East for Polish, Ukrainian and Belarusian (… as well as Upper and Lower Sorbian)
  • 3. The main idea Morphosyntactic annotation is needed by theoretical, computational and applied linguistics alike. Morphosyntactic annotation requires a tagset (ideally one consistent with linguistic theory and then with grammatical tradition).
  • 4. The main idea (continued) Comparative theoretical studies, morphosyntactic annotation in parallel corpora, bi- and multilingual dictionary making all require a common, crosslinguistically consistent tagset.
  • 5. … that is, a tagset which treats like phenomena in like ways, treats unlike phenomena in unlike ways, reflects the structural, etymological and semantic unity of grammatical categories to the greatest extent (especially in the case of closely related languages).
  • 6. MULTEXT-East 11 tagsets developed in v.3, with 3 more added in v.4: Indo-European Slavic − East (1): RU West (2): CS, SK South (4): SL, SL-R, HR, SR, BG, MK non-Slavic (3): EN, RO, FA − Uralic (2): HU, ET
  • 7. MULTEXT-East: virtues Intended to be a multilanguage tagset from the beginning. Already de facto standard for several languages. Thus a natural starting point for further work in this field.
  • 8. MULTEXT-East: same phenomenon, different treatment attributive participles verb forms (BG, RU), − adjectives (CS, SK); − adverbial participles V, Vform=gerund (BG), − V, Vform=transgressive (CS, SK), − R, Type=verbal|causal (HU). −
  • 9. MULTEXT-East: same phenomenon, different treatment virile (masculine human) forms of numerals BG двама: Form=m_form − SK dvaja: Form=letter −
  • 10. MULTEXT-East: similar phenomenon, different treatment short:long forms of adjectives Formation=nominal:compound (CS), − Definiteness=no:yes (SL), − Definiteness=short_art:full_art (RU), − ?? (BG). −
  • 11. MULTEXT-East: same term, different content M, Type=multipl[icative] adverbial: dvakrát (CS), − adjectival: dvojen (SL) − V, Definiteness: HU vs BG N, Case=genitive: FA vs everything else
  • 12. MULTEXT-East: language-specific solutions Clitic_s (CS) extra cases (RU) RU цвет чая ~ чашка чаю, шишка в снегу ~ вдохновение в снеге; UA муха в меді ~ зварено на меду, краснопера (individual) ~ красноперу (species); BE пераезда (place) ~ пераезду (act); CS bratrovi ~ bratru Janovi
  • 13. That national grammatical traditions have often been followed is understandable. But comparative work requires a common theoretical ground, the lack of which defeats the purpose of a common tagset. So some traditional propositions will have to be sacrificed.
  • 14. Moreover, traditional grammar can be inconsistent. Bulgarian: водата ми ‘my water, το νερό μου’ − дай ми ‘give me, δώς μου’ − Slovak: 2nd singular reflexive personal tebou sebou possessive tvoj svoj
  • 15. Priorities for a pan-Slavic tagset crosslinguistic consistency, linguistic adequacy, compactness.
  • 16. Agglutination A mechanism definitely needed for the floating copula (as well as other clitics) in Polish. powinniście znać − słyszeliście − gdzieżeście słyszeli …? − czybyście uwierzyli …? − po coście mnie tu przyniosły? −
  • 17. Agglutination (continued) Will also do for: Czech − floating -s < jsi (currently Clitic_s for verbs and pronouns only), aby, kdeby + -ch(om), -s(te) (currently inflecting particles); Czech/Slovak -že; − adposition+pronoun compounds: − PL przezeń, HS tohodla, etc.
  • 18. Additional features needed N, V, A, P, M: Virile (for SK, PL, UA, BE, HS, DS, BG) N: CaseForm (first, second) (currently Case2=p|l) V: Agglutinativity, Vocalicity
  • 19. Additional features needed A: Voice, Negation (for participles) A: Owner_Gender (for Sorbian) HS stareje žoniny syn DS našogo nanowe crjeje P: Post-prepositional (by any name) HS jón ~ njón, što ~ čo RU ниже них ~ ниже их S: Vocalicity
  • 20. Additional values needed N|Type: gerund (for Polish at least) Aspect, Negation − N|Gender: common V|Aspect: biaspectual V|Person: inclusive (for Russian) A|Type: participle (etc.), pre-adjectival P|Person: reflexive
  • 21. Conversion of existing formats for Polish and Ukrainian to an MTE-like format Resources for morphological processing of Polish and Ukrainian have been developed independently from the project MTE in Poland (IPI—PAS corpus) and Ukraine (ULIF NASU corpus), respectively. Morphological information is encoded in the form of grammatical dictionaries that allow for both analysing and synthesising word forms. The granulation of grammatical information there and the formats of recording it differ considerably from the core MTE tagset. Grammatical categories and values overlap (are one-to-one relations) only in part; some of them have to be decomposed into finer ones, and new categories/values need to be assigned to all relevant lexemes in a grammatical dictionary. On the other hand, grammatical dictionaries contain information that is not necessary for MTE-like tagging.
  • 22. Conversion of existing formats for Polish and Ukrainian to an MTE-like format Two possible levels of introducing changes into Polish and Ukrainian grammatical sources: level of conversion of tagged texts, or directly in the dictionary source files. Polish source files are not available for processing and development. Ukrainian: additional grouping of lexemes is done within UGTag, morphological tagger with the possibility of adding new words from tagged texts, unrecognised by the tagger. One possible output format of UGTag will be an MTE-like tagged text. Belarusian: a grammatical dictionary is under development now on the basis of an extensive orthographic dictionary; suggestions concerning its design and compatibility with MTE-like tagging format can be taken into account, no further conversion will be required.
  • 23. Conversion of existing formats for Polish and Ukrainian to an MTE-like format The tagsets for Polish (IPIC) and Ukrainian (UGD) were brought together within the PolUKR project with the aim of creating a common tagset for the parallel corpus of those languages. The criterion of minimal information loss was used, although the common tagset is not a pure arithmetic sum of the two tagsets. it was based on the pattern of IPIC, as it was easier this way to adjust the search program Poliqarp for the needs of PolUKR. Since MTE-like tagging is becoming a standard now, it was decided to bring the PolUKR tagset to conformity with it.
  • 24. Fragment of the conversion table IPIC/PolUKR → MTE v.3/4 (111 dictionary positions): English term PolUKR MTE tag (fragment) example tag qub Q niech particle-adverb dsc Q властиво discourse markers inf V, VForm=n спатоньки infinitive imps V, VForm=t rozpoczęto, robiono impersonal form adverbial participle part V, VForm=r simultaneous adverbial participle pcon V, VForm=r, Tense=p роблячи, robiąc anterior adverbial participle pant V, VForm=r, Tense=a, зробивши, zrobiwszy Aspect=e simultaneous past participle ppast V, VForm=r, Tense=a, робивши, *robiwszy (rare) Aspect=p common (general) noun gnoun N, Type=c шахи proper name propnoun N, Type=p Сколе disparaging (depreciative) noun depr N, Animate=y, Human=n profesory 1st- or 2nd-person pro-noun ppron12 P, Type=p, Person=(1|2) я, ти gerund ger N, Type=g robienie, nierobienie niezrobienie 3rd-person pro-noun ppron3 P, Type=p, Person=3 він, вони
  • 25. And a fragment of the correspondence table MTE v.3/4 → IPIC/PolUKR (332 positions): attribute value value name IPIC/PolUKR equivalent category code Adjective(A) Aspect E perfective (pact|pass)&aspect=perfective progressive (pact|pass)&aspect=imperfective Adjective(A) Aspect P Adjective(A) Voice A active pact&aspect=perfective Adjective(A) Voice P passive pass&aspect=perfective Adverb (R) R adv|adjp|pred Verb(V) VForm I indicative fin|praet|bedzie Verb(V) Tense P present fin&aspect=imperf Verb(V) Tense F future bedzie|(fin&aspect=perf)
  • 26. A fragment of the XML specification file for Ukrainian compatible with the MTE-4 proposal for Russian: <row role=quot;attributequot;> <cell xml:lang=quot;enquot; role=quot;positionquot;>6</cell> <cell role=quot;namequot; xml:lang=quot;enquot;>Case2</cell> <cell xml:lang=quot;enquot; role=quot;valuesquot;> <table> <row role=quot;valuequot;> <cell role=quot;namequot; xml:lang=quot;enquot;>genitive</cell> <cell role=quot;codequot; xml:lang=quot;enquot;>g</cell> </row> <row role=quot;valuequot;> <cell role=quot;namequot; xml:lang=quot;enquot;>dative</cell> <cell role=quot;codequot; xml:lang=quot;enquot;>d</cell> </row> <row role=quot;valuequot;> <cell role=quot;namequot; xml:lang=quot;enquot;>locative</cell> <cell role=quot;codequot; xml:lang=quot;enquot;>l</cell> </row> </table> </cell> </row>
  • 27. The same fragment for Ukrainian according to our proposals: <row role=quot;attributequot;> <cell xml:lang=quot;enquot; role=quot;positionquot;>6</cell> <cell role=quot;namequot; xml:lang=quot;enquot;>CaseForm</cell> <cell xml:lang=quot;enquot; role=quot;valuesquot;> <table> <row role=quot;valuequot;> <cell role=quot;namequot; xml:lang=quot;enquot;>first</cell> <cell role=quot;codequot; xml:lang=quot;enquot;>1</cell> </row> <row role=quot;valuequot;> <cell role=quot;namequot; xml:lang=quot;enquot;>second</cell> <cell role=quot;codequot; xml:lang=quot;enquot;>2</cell> </row> <row role=quot;valuequot;> <cell role=quot;namequot; xml:lang=quot;enquot;>third</cell> <cell role=quot;codequot; xml:lang=quot;enquot;>3</cell> </row> </table> </cell> </row>
  • 28. Conclusions and recommendations General agreement on the tagset to be achieved among its developers; a common ground must be found. In its current state the MTE tagset includes information from different levels of language description: purely morphological, derivational, syntactic and semantic. Syntactic and semantic analysis and tagging are further necessary steps in language description, and principles of tagging for them should be developed. The layer of derivation is significant for (semi)automatic lexicon development. Information currently encoded about levels other than the morphological one (such as valency for prepositions or classification of pronoun types) should also be redistributed in the future.
  • 29. Дякуємо за увагу Благодарим за внимание Дзякуем за ўвагу Благодарим за вниманието Dziękujemy za uwagę Благодараме за вниманието Bug lunej za to atencjun Dźakujemoj so za kedźbnosć Źěkujomej se za zajmowanosć Mulţumim pentru atenţie Ďakujemе za pozornosť Thank you for your attention Děkujeme za pozornost Täname tähelepanu eest Köszönjük a figyelmet Zahvaliva se za pozornost ‫از دقتتان سپاسگزاريم‬ Zahvaljujeme na pažnju