In this talk, I will present the UniMorph project, an attempt to create a universal (cross-lingual) annotation schema. UniMorph allows an inflected word from any language to be defined by its lexical meaning, typically carried by the lemma, and a bundle of universal morphological features defined by the schema. Since 2016, the UniMorph database has been gradually developed and updated with new languages, and SIGMORPHON shared tasks served as a platform to compare computational models of inflectional morphology. During 2016–2021, the shared tasks made it possible to explore the data-driven systems’ ability to learn declension and conjugation paradigms as well as to evaluate how well they generalize across typologically diverse languages. It is especially important, since elaboration of formal techniques of cross-language generalization and prediction of universal entities across related languages should provide a new potential to the modeling and documentation of under-resourced languages. I will outline the major challenges we faced while converting the language-specific features into the UniMorph schema, especially in under-resourced languages. In addition, I will discuss typical errors made by the majority of the systems, e.g. incorrectly predicted instances due to allomorphy, form variation, misspelled words, looping effects. Finally, I will provide case studies for Russian, Tibetan, and Nen.
Documenting and modeling inflectional paradigms in under-resourced languages
1. Documenting and Modeling
Inflectional Paradigms in
Under-resourced Languages
By Ekaterina Vylomova
evylomova@gmail.com
Cardamom Seminar Series
Feb, 1 2022
4. Approx. 7,000* languages in the world…
* Language vs. dialect distinction: “A language is a dialect with its own army and navy” (Max
Weinreich)
5. Roman Jakobson on differences
between Languages
``Languages differ essentially in what they
must convey and not in what they may
convey''
6. Chinese (Isolating; strict word order)
wǒmen xué le zhè xiē shēngcí.
I.PL.AN learn .PAST this .PL new word.
``We learned these new words.''
7. Russian (Synthetic;flexible word order)
My vyučili eti novyje slova.
We learn.PAST.PL this.ACC.PL new.ACC.PL word.ACC.PL
``We learned these new words.''
10. Approx. 7,000* languages in the world…
… But majority of NLP technologies only focus on most documented
languages such as English, French, German, Russian, Hindi, or
Finnish
* Language vs. dialect distinction: “A language is a dialect with its own army and navy” (Max
Weinreich)
11. Most NLP is Standard Average European
… According to Martin Haspelmath (2001), “euroversals” share:
● definite and indefinite articles (e.g. English the vs. a)
● a periphrastic perfect formed with 'have' plus a passive participle (e.g. English I
have said);
● the verb is inflected for person and number of the subject, but subject
pronouns may not be dropped even when this would be unambiguous
● Some features that are common in European langs but also found elsewhere:
○ lack of distinction between inclusive and exclusive first-person plural pronouns ("we
and you" vs. "we and not you")
○ lack of distinction between alienable and inalienable (e.g. body part) possession;
SAE was introduced by Whorf, 1939
12. Most NLP is Standard Average European
Morphology: Position of Case Affixes (WALS 51A)
SAE was introduced by Whorf, 1939
Suffixing morphology: Eurasian
and Australian languages;
Prefixation: Mesoamerican
languages and African languages
spoken below the Sahara
13. Most NLP is Standard Average European
Morphology: Inclusive/Exclusive Distinction in Independent Pronouns (WALS 39A)
SAE was introduced by Whorf, 1939
No Incl.excl.: Eurasian
Incl/excl.: Australian and South
American languages and;
14. Most NLP is Standard Average European
Morphology: Possessive Classification (WALS 59A)
SAE was introduced by Whorf, 1939
No Possessive: Eurasian
Possessive: Australian, African,
American;
15. Most NLP is Standard Average European
Morphology: The Past Tense (WALS 66A)
SAE was introduced by Whorf, 1939
Past tense, 1: Eurasian
No past tense: South-East Asia,
African, American;
17. Wiktionary: > 8000 language codes (incl.
historic)
https://en.wiktionary.org/wiki/Wiktionary:List_of_languages
18. Wiktionary: language-specific
беглец (“male escapee/runaway”) + pos=N, case=ACC, number=SG → беглеца
Functionally similar feature
names (cases, numbers, etc )
differ across languages!
Highly depends on the
descriptive tradition used to
document/describe a
language!
19. UniMorph: Universal Morphology
1) 23 dimensions of meaning (TAM, case, number, animacy), 212
features
2) A-morphous morphology (Anderson, 1992)
3) Paradigms extracted from English Wiktionary (Kirov et al., 2016)
https://unimorph.github.io/doc/unimorph-schema.pdf
By John Sylak-Glassman, 2015
25. Case Systems:
Non-Core
Semblative (like smth, e.g.
English “human-like”;
Evenki) → COMPV
Abessive (English “-less”,
“without” ; Uralic) → PRIV
Causal case (“because of
this…”; Moksha) → LGSPEC
Prepositional case
(location, “about *”; Slavic)
→ ESS
XXXX
26. Case Systems:
Local
Uralic:
Inessive (“in”) → IN+ESS
Elative (“out of”) → IN+ABL
Illative (“into”) → IN +ALL
Allative (“onto”) → AT+ALL
Essive → FRML
Pl– Place; Dst – Distal; Mot – Motion; Asp –
Aspect
27. Case Systems: challenges in UniMorph
Cases (and other features) combinations:
Evenki: Accusative reflexive-genitive (ACC+PSSRS); Accusative Definite (ACC+DEF)/Indefinite
(ACC+INDF)
Case compounding:
Uralic(e.g. Beserman Udmurt;
from Maria Usacheva)
Case stacking:
Kayardild (up to 6 cases!):
Also in Chukchi, Sumerian, Basque, and others
29. A language may use different scripts over
time
… and it’s often a political question!
1. E.g., USSR invented many writing
systems for various indigenous
peoples that didn’t have their own
system
2. Still, some republics have ongoing
debates on the system they
should use (e.g., Latin- vs.
Cyrillic-based)
30. A language may use different scripts over
timeLanguage Arabic Latin Cyrillic Other
Bashkir < end of 1920s 1930s 1940–
Kazakh < end of 1920s 1930s 1940–
Tatar < end of 1920s 1930s;2000s 1940–
Uzbek < end of 1920s 1930s; 1990– 1940–
Uyghur < end of 1920s; but still
used in China
1930–1940s 1946–
Buryat 1930s End of 1930s– <1920s: Mongolian
Chukchi 1930s 1937– Tenevil
Udmurt 1930s end of 18th cent. —
Evenki 1930s 1937– Mongolian (China)
31. A language may use different scripts over
timeLanguage Arabic-based Latin-based Cyrillic-based Other
Bashkir < end of 1920s 1930s 1940–
Kazakh < end of 1920s 1930s 1940–
Tatar < end of 1920s 1930s;2000s 1940–
Uzbek < end of 1920s 1930s; 1990– 1940–
Uyghur < end of 1920s; but still
used in China
1930–1940s 1946–
Buryat 1930s End of 1930s– <1920s: Mongolian
Chukchi 1930s 1937– Tenevil
Udmurt 1930s end of 18th cent. —
Evenki 1930s 1937– Mongolian (China)
May consider all existing scripts and data
but better to check with native speakers
which one would be preferred !
33. SIGMORPHON Shared Task on
Morphological (Re-)Inflection
Lemma Tag Form
RUN V;PAST ran
RUN V;PRES;1;SG run
RUN V;PRES;2;SG run
RUN V;PRES;3;SG runs
RUN V;PRES;PL run
RUN V;PART running
Inflection: RUN + V;PST → ran
reinflection: running +V;PST → ran
Cotterell et al., 2016–2018
McCarthy et al., 2019
Vylomova et al., 2020
Pimentel, Ryskina et al, 2021
34. SIGMORPHON Shared Task on
Morphological (Re-)Inflection
Lemma Tag Form
RUN V;PAST ran
RUN V;PRES;1;SG run
RUN V;PRES;2;SG run
RUN V;PRES;3;SG runs
RUN V;PRES;PL run
RUN V;PART running
Inflection: RUN + V;PST → ran
reinflection: running +V;PST → ran
Approx. 96% avg. accuracy on
high-resource languages!
Significantly less in under-resourced
languages!
Winning systems are neural seq2seq
models
See more details in my SIGTYP Talk
35. Error Taxonomy (Gorman et al., 2019)
● Free variation error: more than one acceptable form exists
● Extraction errors: flaws in UniMorph’s parsing of Wiktionary
● Wiktionary errors: errors in Wiktionary/ ling. source data itself
● Silly errors: “bizarre” errors which defy any purely linguistic
characterization (``*membled'' instead of ``mailed'' or enters a
loop such as ``ynawemaylmyylmyylmyylmyylmyylmyym...''
instead of ``ysnewem'')
● Allomorphy errors: misapplication of existing allomorphic
patterns
● Spelling errors: forms that do not follow language-specific
orthographic conventions
37. Allomorphy Errors
● Stem-final vowels in Finnish (*pohjanpystykorvojen); Consonant gradation in Finnish (*ei kiemurda)
● Ablaut in Dutch and German (*pront; *saufte)
● Umlaut (*Einwohnerzähle, *Förmer), plural suffixes, Verbal prefixes in German (*umkehre)
● Linking vowels in Hungarian (*masszázsakból instead of *masszázsokból)
● Yers (*klęsek instead of klęsk), Genitive singular suffixes in Polish (*izotopa)
● Animacy in Polish and Russian (грузин vs. магазин in ACC.SG )
● Aspect in Russian (*будешь сорвать)
● Internal inflection in Russian compounds (*государствах-донорах, *лёгких промышленности
(ACC.PL))
38. Allomorphy Errors
● Stem-final vowels in Finnish (*pohjanpystykorvojen); Consonant gradation in Finnish (*ei kiemurda)
● Ablaut in Dutch and German (*pront; *saufte)
● Umlaut (*Einwohnerzähle, *Förmer), plural suffixes, Verbal prefixes in German (*umkehre)
● Linking vowels in Hungarian (*masszázsakból instead of *masszázsokból)
● Yers (*klęsek instead of klęsk), Genitive singular suffixes in Polish (*izotopa)
● Animacy in Polish and Russian (грузин vs. магазин in ACC.SG )
● Aspect in Russian (*будешь сорвать)
● Internal inflection in Russian compounds (*государствах-донорах, *лёгких промышленности
(ACC.PL))
Poor Performance on unseen lemmas
39. Some writing systems are more challenging!
Tibetan:
○ Nonce words and impossible combinations of component units (Di et al., 2019)
40. A Case Study on Nen (PNG; Muradoglu et al.,
2020)
● Spoken in the village of Bimadbn in the
Western Province of PNG, by approx
400 people
● Verbs: prefixing, middle, and ambifixing
● Distributed Exponence (DE);
“morphosyntactic feature values can
only be determined after unification of
multiple structural positions"
41. A Case Study on Nen (PNG; Muradoglu et al.,
2020)
● Low accuracy on small number of samples
(<1000)
● Allomorphy: vowel harmony
● Looping:
*ynawemaylmyylmyylmyylmy-ylmyylmyymaya
mawemyymamya
How well do they generalize?
Syncretism Test: all the TAM categories exhibit
syncretism across the second and third-person
singular actor. Exception: The past perfective slot
(where they take different forms)
Not observing the past perfective forms, systems
tend to predict the forms as syncretic
(generalizing from observed slots), resulting in the
misprediction of the actual forms (exceptions)
44. Experiment
data
15
25
35
22
Evenki :
● The dataset has been created from oral
speech samples
● Little attempt at any standardization in the
oral speech transcription
● The Evenki language is known to have rich
dialectal variation
● The vowel harmony is complex: not all
suffixes obey it, and it is also
dialect-dependent
Schema:
● Various past tense forms are all annotated as
PST
● Several comitative suffixes all annotated as
COM
● Some features are present in the word form
but they receive no annotation at all
Elena Klyachko
ST 2021
45. Experiment
data
15
25
35
22
Chukchi (Luorawetlat):
● The dataset has been created from oral
speech samples
● Very small and sparse dataset
● Polypersonal agreement
● very productive incorporation
English analogue: I am
bookreading/dishwashing/fruitcutting.
● Complex vowel harmony:
Morphemes: "plus" with [a], [o], [e], [ə] and
"minus" with [i], [e], [u], [ə]
If a word with minus morphemes gets an affix
with plus morphemes, all the minus ones turn
into plus: [i] -> [e], [e] -> [a], [u]--> [o].
Нутэнут + N;DAT → нотагты
Schema:
● No attempt to split genderlects
«walrus»: male рыркы (rərkə) — female
цыццы tsəttsə
Elena Klyachko, Maria Ryskina
ST 2021
46. Experiment
data
15
25
35
22
Itelmen:
● Very small and sparse dataset
● Verbs mark both subjects (with prefixes and
suffixes) and objects (with suffixes); hard to
capture combinations:
Gold Tag Predicted
čʼelisčiŋnen V;NO3S+AB3S;FIN;IND;PST čʼelnen
enčʼiɬivnen V;NO3S+AB3S;FIN;IND;PRS;CAUS ənčʼiɬiznen
nəntxlakninˀin V;NO3P+AB3S;FIN;FOC;IND;PST
nəntxlakisxenˀin
● Complex phono- and morphotactics
ktavolknan V.CVB;NFIN ktavolˀin
ktekejknen V.CVB;NFIN ktekejˀin
Karina Sheifer, Sofya Ganieva,
Matvey Plugaryov
ST 2021
47. Experiment
data
15
25
35
22
Kunwinjku
● Very small and sparse dataset
● Complex phonotactics:
borlbme+V;2;PL;Non-PST → *ngurriborlbme
(instead of ngurriborle)
● Falling into loops:
*ngarrrrrrrrrrrrrrmbbbijj (should be
karribelbmerrinj)
or *ngadjarridarrkddrrdddrrmerri (should be
karriyawoyhdjarrkbidyikarrmerrimeninj)
William Lane, Steven Bird
ST 2021
48. Experiment
data
15
25
35
22
Eibela
● Very small and sparse dataset (comes from inter-linear texts)
● Complex features, many of which didn’t have accurate
UniMorph analogues
ASS (associated place) → Approximative APPRX
ASS.EV (associated event) → Intransitive INTR
A.NOM (Agent nominalization) → Noun N
CONTINUOUS/PROG →Progressive PROG
COORD (Coordinator) → Comitative COM
DEL.IMP: DEL (Delayed imperative → Imperative-Jussive IMP ;
Future FUT
DEO.DUR (deontic.durative) →Obligative OBLIG ; durative DUR
D.S (different subject) →DS
HYPO (hypothetical) → Irrealis IRR
PERFECT/RESULTATIVE →Perfect PRF
PROHIB → IMP;NEG
PROGRESSIVE → PROG
SIM/SIM, S.S (simultaneous) → Simultaneous Multiclausal Aspect
SIMMA
ST.IMP (Strong imperative) → Imperative-Jussive Mood ; Remote
REMT
TEL.EV (Telic) → Telic TEL
UNCERT (Uncertain) → Potential POT
● Misprediction of vowel length:
to:mulu+N;ERG -->tomulε (should be of
to:mu:lε:)
Grant Aiton
ST 2021
49. Which language data can be
used for typological studies
(e.g., compare
morphological complexity)?
51. Approx.
amount of data
for each
ST2021
language
Polysynthetic languages
represented by a few samples
A few samples from glosses;
No full paradigms
Mainly full paradigms
Chukchi data contains:
Only 3 out 9 gram. cases;
3 out of 6 tenses
52. Approx.
amount of data
for each
ST2021
language
Polysynthetic languages
represented by a few samples
A few samples from glosses;
No full paradigms
Mainly full paradigms
Chukchi data contains:
Only 3 out 9 gram. cases;
3 out of 6 tenses
Check how representative the dataset is
before you use it for typological studies!
54. Paradigms in polysynthetic languages
Currently we have very limited representation of paradigms in
polysynthetic languages
Not clear what a potential paradigm should contain
55. Derivation – Inflection continuum
No strict boundary between inflection and derivation
Derivational morphology also represents systematicity and
regularity (English agentive “-er”, absence “-less”, ability
“-able”)
No clear dimensions of derivational meanings
56. Morphology and Syntax
Should we include clitics, copulas, MWEs?
If we do, the size of paradigm tables grows exponentially
What might be an efficient way to store them?