Documenting and modeling inflectional paradigms in under-resourced languages

Documenting and Modeling
Inﬂectional Paradigms in
Under-resourced Languages
By Ekaterina Vylomova
evylomova@gmail.com
Cardamom Seminar Series
Feb, 1 2022

Approx. 7,000* languages in the world…
* Language vs. dialect distinction: “A language is a dialect with its own army and navy” (Max
Weinreich)

Roman Jakobson on diﬀerences
between Languages
``Languages diﬀer essentially in what they
must convey and not in what they may
convey''

Chinese (Isolating; strict word order)
wǒmen xué le zhè xiē shēngcí.
I.PL.AN learn .PAST this .PL new word.
``We learned these new words.''

Russian (Synthetic;ﬂexible word order)
My vyučili eti novyje slova.
We learn.PAST.PL this.ACC.PL new.ACC.PL word.ACC.PL
``We learned these new words.''

Nannu-n-niuti-kkuminar-tu-rujussu-u-vuq.
Polar.bear-catch-instrument.for.achieving-something.good.for-PART-
big-be-3SG.INDIC
``It (a dog) is good for catching polar bears with.''
Speedtalk in “Gulf” by Robert Heinlein
West Greenlandic (Polysynthetic; Fortescue
(2017))

Aban-yawoith-warrgah-marne-ganj-ginje-ng.
1/3PL-again-wrong-BEN-meat-cook-PP
``I cooked the wrong meat for them again''
Kunwinjku (Polysynthetic; Evans (2003))

Approx. 7,000* languages in the world…
… But majority of NLP technologies only focus on most documented
languages such as English, French, German, Russian, Hindi, or
Finnish
* Language vs. dialect distinction: “A language is a dialect with its own army and navy” (Max
Weinreich)

Most NLP is Standard Average European
… According to Martin Haspelmath (2001), “euroversals” share:
● definite and indefinite articles (e.g. English the vs. a)
● a periphrastic perfect formed with 'have' plus a passive participle (e.g. English I
have said);
● the verb is inflected for person and number of the subject, but subject
pronouns may not be dropped even when this would be unambiguous
● Some features that are common in European langs but also found elsewhere:
○ lack of distinction between inclusive and exclusive first-person plural pronouns ("we
and you" vs. "we and not you")
○ lack of distinction between alienable and inalienable (e.g. body part) possession;
SAE was introduced by Whorf, 1939

Morphology: Position of Case Affixes (WALS 51A)
Suffixing morphology: Eurasian
and Australian languages;
Prefixation: Mesoamerican
languages and African languages
spoken below the Sahara

Morphology: Inclusive/Exclusive Distinction in Independent Pronouns (WALS 39A)
No Incl.excl.: Eurasian
Incl/excl.: Australian and South
American languages and;

Morphology: Possessive Classiﬁcation (WALS 59A)
No Possessive: Eurasian
Possessive: Australian, African,
American;

Morphology: The Past Tense (WALS 66A)
Past tense, 1: Eurasian
No past tense: South-East Asia,
African, American;

UniMorph: Universal
Morphosyntactic annotation

Wiktionary: > 8000 language codes (incl.
historic)
https://en.wiktionary.org/wiki/Wiktionary:List_of_languages

Wiktionary: language-speciﬁc
беглец (“male escapee/runaway”) + pos=N, case=ACC, number=SG → беглеца
Functionally similar feature
names (cases, numbers, etc )
diﬀer across languages!
Highly depends on the
descriptive tradition used to
document/describe a
language!

UniMorph: Universal Morphology
1) 23 dimensions of meaning (TAM, case, number, animacy), 212
features
2) A-morphous morphology (Anderson, 1992)
3) Paradigms extracted from English Wiktionary (Kirov et al., 2016)
https://unimorph.github.io/doc/unimorph-schema.pdf
By John Sylak-Glassman, 2015

UniMorph: Universal Morphology
https://unimorph.github.io/doc/unimorph-schema.pdf
A Russian noun declension:
Lemma Form UM Tags/Features

From Language-speciﬁc to
Universal features
Descriptive categories (speciﬁc to languages) vs. comparative concepts
(Haspelmath, 2010)

Gender
Noun class Bantu Languages: approx.
23 noun classes
Nakh-Daghestanian
Languages: 2–8 classes
● Masculine
● Feminine
● Neuter
Corbett, Greville G. 1991. Gender. Cambridge,
UK: Cambridge University Press.

Gender
Noun class
Bantu Languages: approx.
23 noun classes
● Masculine
● Feminine
● Neuter
Corbett, Greville G. 1991. Gender. Cambridge,
UK: Cambridge University Press.

Case Systems:
Non-Core
Semblative (like smth, e.g.
English “human-like”;
Evenki) → COMPV
Abessive (English “-less”,
“without” ; Uralic) → PRIV
Causal case (“because of
this…”; Moksha) → LGSPEC
Prepositional case
(location, “about *”; Slavic)
→ ESS
XXXX

Case Systems:
Local
Uralic:
Inessive (“in”) → IN+ESS
Elative (“out of”) → IN+ABL
Illative (“into”) → IN +ALL
Allative (“onto”) → AT+ALL
Essive → FRML
Pl– Place; Dst – Distal; Mot – Motion; Asp –
Aspect

Case Systems: challenges in UniMorph
Cases (and other features) combinations:
Evenki: Accusative reflexive-genitive (ACC+PSSRS); Accusative Definite (ACC+DEF)/Indefinite
(ACC+INDF)
Case compounding:
Uralic(e.g. Beserman Udmurt;
from Maria Usacheva)
Case stacking:
Kayardild (up to 6 cases!):
Also in Chukchi, Sumerian, Basque, and others

A language may use diﬀerent scripts over
time
… and it’s often a political question!
1. E.g., USSR invented many writing
systems for various indigenous
peoples that didn’t have their own
system
2. Still, some republics have ongoing
debates on the system they
should use (e.g., Latin- vs.
Cyrillic-based)

timeLanguage Arabic Latin Cyrillic Other
Bashkir < end of 1920s 1930s 1940–
Kazakh < end of 1920s 1930s 1940–
Tatar < end of 1920s 1930s;2000s 1940–
Uzbek < end of 1920s 1930s; 1990– 1940–
Uyghur < end of 1920s; but still
used in China
1930–1940s 1946–
Buryat 1930s End of 1930s– <1920s: Mongolian
Chukchi 1930s 1937– Tenevil
Udmurt 1930s end of 18th cent. —
Evenki 1930s 1937– Mongolian (China)

timeLanguage Arabic-based Latin-based Cyrillic-based Other
Bashkir < end of 1920s 1930s 1940–
Kazakh < end of 1920s 1930s 1940–
Tatar < end of 1920s 1930s;2000s 1940–
Uzbek < end of 1920s 1930s; 1990– 1940–
Uyghur < end of 1920s; but still
used in China
1930–1940s 1946–
Buryat 1930s End of 1930s– <1920s: Mongolian
Chukchi 1930s 1937– Tenevil
Udmurt 1930s end of 18th cent. —
Evenki 1930s 1937– Mongolian (China)
May consider all existing scripts and data
but better to check with native speakers
which one would be preferred !

SIGMORPHON Shared Task
on Morphological
Reinﬂection

SIGMORPHON Shared Task on
Morphological (Re-)Inflection
Lemma Tag Form
RUN V;PAST ran
RUN V;PRES;1;SG run
RUN V;PRES;2;SG run
RUN V;PRES;3;SG runs
RUN V;PRES;PL run
RUN V;PART running
Inflection: RUN + V;PST → ran
reinflection: running +V;PST → ran
Cotterell et al., 2016–2018
McCarthy et al., 2019
Vylomova et al., 2020
Pimentel, Ryskina et al, 2021

SIGMORPHON Shared Task on
Morphological (Re-)Inflection
Lemma Tag Form
RUN V;PAST ran
RUN V;PRES;1;SG run
RUN V;PRES;2;SG run
RUN V;PRES;3;SG runs
RUN V;PRES;PL run
RUN V;PART running
Inflection: RUN + V;PST → ran
reinflection: running +V;PST → ran
Approx. 96% avg. accuracy on
high-resource languages!
Significantly less in under-resourced
languages!
Winning systems are neural seq2seq
models
See more details in my SIGTYP Talk

Error Taxonomy (Gorman et al., 2019)
● Free variation error: more than one acceptable form exists
● Extraction errors: ﬂaws in UniMorph’s parsing of Wiktionary
● Wiktionary errors: errors in Wiktionary/ ling. source data itself
● Silly errors: “bizarre” errors which defy any purely linguistic
characterization (``*membled'' instead of ``mailed'' or enters a
loop such as ``ynawemaylmyylmyylmyylmyylmyylmyym...''
instead of ``ysnewem'')
● Allomorphy errors: misapplication of existing allomorphic
patterns
● Spelling errors: forms that do not follow language-speciﬁc
orthographic conventions

Error Taxonomy (Gorman et al., 2019)
Majority of errors are due to allomorphy

Allomorphy Errors
● Stem-final vowels in Finnish (*pohjanpystykorvojen); Consonant gradation in Finnish (*ei kiemurda)
● Ablaut in Dutch and German (*pront; *saufte)
● Umlaut (*Einwohnerzähle, *Förmer), plural suffixes, Verbal prefixes in German (*umkehre)
● Linking vowels in Hungarian (*masszázsakból instead of *masszázsokból)
● Yers (*klęsek instead of klęsk), Genitive singular suffixes in Polish (*izotopa)
● Animacy in Polish and Russian (грузин vs. магазин in ACC.SG )
● Aspect in Russian (*будешь сорвать)
● Internal inflection in Russian compounds (*государствах-донорах, *лёгких промышленности
(ACC.PL))

Allomorphy Errors
● Stem-final vowels in Finnish (*pohjanpystykorvojen); Consonant gradation in Finnish (*ei kiemurda)
● Ablaut in Dutch and German (*pront; *saufte)
● Umlaut (*Einwohnerzähle, *Förmer), plural suffixes, Verbal prefixes in German (*umkehre)
● Linking vowels in Hungarian (*masszázsakból instead of *masszázsokból)
● Yers (*klęsek instead of klęsk), Genitive singular suffixes in Polish (*izotopa)
● Animacy in Polish and Russian (грузин vs. магазин in ACC.SG )
● Aspect in Russian (*будешь сорвать)
● Internal inflection in Russian compounds (*государствах-донорах, *лёгких промышленности
(ACC.PL))
Poor Performance on unseen lemmas

Some writing systems are more challenging!
Tibetan:
○ Nonce words and impossible combinations of component units (Di et al., 2019)

A Case Study on Nen (PNG; Muradoglu et al.,
2020)
● Spoken in the village of Bimadbn in the
Western Province of PNG, by approx
400 people
● Verbs: prefixing, middle, and ambifixing
● Distributed Exponence (DE);
“morphosyntactic feature values can
only be determined after unification of
multiple structural positions"

A Case Study on Nen (PNG; Muradoglu et al.,
2020)
● Low accuracy on small number of samples
(<1000)
● Allomorphy: vowel harmony
● Looping:
*ynawemaylmyylmyylmyylmy-ylmyylmyymaya
mawemyymamya
How well do they generalize?
Syncretism Test: all the TAM categories exhibit
syncretism across the second and third-person
singular actor. Exception: The past perfective slot
(where they take diﬀerent forms)
Not observing the past perfective forms, systems
tend to predict the forms as syncretic
(generalizing from observed slots), resulting in the
misprediction of the actual forms (exceptions)

Using machine learning to
evaluate our data quality

Experiment
data
15
25
35
22
What are the most “challenging”
languages?
● Syc: Syriac (train: 1217 rows/534
lemmas)
● Ckt: Chukchi (132 rows/118 lemmas)
● Itl: Itelmen (1246 rows/849 lemmas)
● Gup: Kunwinjku (214 rows/62
lemmas)
● Bra: Braj (1082 rows/825 lemmas)
● Ail: Eibela (918 rows/470 lemmas)
● Evn: Evenki (5216 rows/2919
lemmas)
● Sjo: Xibe (290 rows/206 lemmas)
ST 2021

Experiment
data
15
25
35
22
Evenki :
● The dataset has been created from oral
speech samples
● Little attempt at any standardization in the
oral speech transcription
● The Evenki language is known to have rich
dialectal variation
● The vowel harmony is complex: not all
suﬃxes obey it, and it is also
dialect-dependent
Schema:
● Various past tense forms are all annotated as
PST
● Several comitative suﬃxes all annotated as
COM
● Some features are present in the word form
but they receive no annotation at all
Elena Klyachko
ST 2021

Experiment
data
15
25
35
22
Chukchi (Luorawetlat):
● The dataset has been created from oral
speech samples
● Very small and sparse dataset
● Polypersonal agreement
● very productive incorporation
English analogue: I am
bookreading/dishwashing/fruitcutting.
● Complex vowel harmony:
Morphemes: "plus" with [a], [o], [e], [ə] and
"minus" with [i], [e], [u], [ə]
If a word with minus morphemes gets an aﬃx
with plus morphemes, all the minus ones turn
into plus: [i] -> [e], [e] -> [a], [u]--> [o].
Нутэнут + N;DAT → нотагты
Schema:
● No attempt to split genderlects
«walrus»: male рыркы (rərkə) — female
цыццы tsəttsə
Elena Klyachko, Maria Ryskina
ST 2021

Experiment
data
15
25
35
22
Itelmen:
● Verbs mark both subjects (with prefixes and
suffixes) and objects (with suffixes); hard to
capture combinations:
Gold Tag Predicted
čʼelisčiŋnen V;NO3S+AB3S;FIN;IND;PST čʼelnen
enčʼiɬivnen V;NO3S+AB3S;FIN;IND;PRS;CAUS ənčʼiɬiznen
nəntxlakninˀin V;NO3P+AB3S;FIN;FOC;IND;PST
nəntxlakisxenˀin
● Complex phono- and morphotactics
ktavolknan V.CVB;NFIN ktavolˀin
ktekejknen V.CVB;NFIN ktekejˀin
Karina Sheifer, Sofya Ganieva,
Matvey Plugaryov
ST 2021

Experiment
data
15
25
35
22
Kunwinjku
● Complex phonotactics:
borlbme+V;2;PL;Non-PST → *ngurriborlbme
(instead of ngurriborle)
● Falling into loops:
*ngarrrrrrrrrrrrrrmbbbijj (should be
karribelbmerrinj)
or *ngadjarridarrkddrrdddrrmerri (should be
karriyawoyhdjarrkbidyikarrmerrimeninj)
William Lane, Steven Bird
ST 2021

Experiment
data
15
25
35
22
Eibela
● Very small and sparse dataset (comes from inter-linear texts)
● Complex features, many of which didn’t have accurate
UniMorph analogues
ASS (associated place) → Approximative APPRX
ASS.EV (associated event) → Intransitive INTR
A.NOM (Agent nominalization) → Noun N
CONTINUOUS/PROG →Progressive PROG
COORD (Coordinator) → Comitative COM
DEL.IMP: DEL (Delayed imperative → Imperative-Jussive IMP ;
Future FUT
DEO.DUR (deontic.durative) →Obligative OBLIG ; durative DUR
D.S (diﬀerent subject) →DS
HYPO (hypothetical) → Irrealis IRR
PERFECT/RESULTATIVE →Perfect PRF
PROHIB → IMP;NEG
PROGRESSIVE → PROG
SIM/SIM, S.S (simultaneous) → Simultaneous Multiclausal Aspect
SIMMA
ST.IMP (Strong imperative) → Imperative-Jussive Mood ; Remote
REMT
TEL.EV (Telic) → Telic TEL
UNCERT (Uncertain) → Potential POT
● Misprediction of vowel length:
to:mulu+N;ERG -->tomulε (should be of
to:mu:lε:)
Grant Aiton
ST 2021

Which language data can be
used for typological studies
(e.g., compare
morphological complexity)?

Approx.
amount of data
for each
ST2021
language

Approx.
amount of data
for each
ST2021
language
Polysynthetic languages
represented by a few samples
A few samples from glosses;
No full paradigms
Mainly full paradigms
Chukchi data contains:
Only 3 out 9 gram. cases;
3 out of 6 tenses

Approx.
amount of data
for each
ST2021
language
Polysynthetic languages
represented by a few samples
A few samples from glosses;
No full paradigms
Mainly full paradigms
Chukchi data contains:
Only 3 out 9 gram. cases;
3 out of 6 tenses
Check how representative the dataset is
before you use it for typological studies!

Current challenges and
open questions

Paradigms in polysynthetic languages
Currently we have very limited representation of paradigms in
polysynthetic languages
Not clear what a potential paradigm should contain

Derivation – Inﬂection continuum
No strict boundary between inﬂection and derivation
Derivational morphology also represents systematicity and
regularity (English agentive “-er”, absence “-less”, ability
“-able”)
No clear dimensions of derivational meanings

Morphology and Syntax
Should we include clitics, copulas, MWEs?
If we do, the size of paradigm tables grows exponentially
What might be an eﬃcient way to store them?

Thank you!
Join and follow us:
https://twitter.com/unimorph_/

Documenting and modeling inflectional paradigms in under-resourced languages

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Documenting and modeling inflectional paradigms in under-resourced languages

Similar to Documenting and modeling inflectional paradigms in under-resourced languages (20)

More from Katerina Vylomova

More from Katerina Vylomova (15)

Recently uploaded

Recently uploaded (20)

Documenting and modeling inflectional paradigms in under-resourced languages