SlideShare a Scribd company logo
Documenting and Modeling
Inflectional Paradigms in
Under-resourced Languages
By Ekaterina Vylomova
evylomova@gmail.com
Cardamom Seminar Series
Feb, 1 2022
NLP: Universal or Euroversal?
Multilinguality
Approx. 7,000* languages in the world…
* Language vs. dialect distinction: “A language is a dialect with its own army and navy” (Max
Weinreich)
Roman Jakobson on differences
between Languages
``Languages differ essentially in what they
must convey and not in what they may
convey''
Chinese (Isolating; strict word order)
wǒmen xué le zhè xiē shēngcí.
I.PL.AN learn .PAST this .PL new word.
``We learned these new words.''
Russian (Synthetic;flexible word order)
My vyučili eti novyje slova.
We learn.PAST.PL this.ACC.PL new.ACC.PL word.ACC.PL
``We learned these new words.''
Nannu-n-niuti-kkuminar-tu-rujussu-u-vuq.
Polar.bear-catch-instrument.for.achieving-something.good.for-PART-
big-be-3SG.INDIC
``It (a dog) is good for catching polar bears with.''
Speedtalk in “Gulf” by Robert Heinlein
West Greenlandic (Polysynthetic; Fortescue
(2017))
Aban-yawoith-warrgah-marne-ganj-ginje-ng.
1/3PL-again-wrong-BEN-meat-cook-PP
``I cooked the wrong meat for them again''
Kunwinjku (Polysynthetic; Evans (2003))
Approx. 7,000* languages in the world…
… But majority of NLP technologies only focus on most documented
languages such as English, French, German, Russian, Hindi, or
Finnish
* Language vs. dialect distinction: “A language is a dialect with its own army and navy” (Max
Weinreich)
Most NLP is Standard Average European
… According to Martin Haspelmath (2001), “euroversals” share:
● definite and indefinite articles (e.g. English the vs. a)
● a periphrastic perfect formed with 'have' plus a passive participle (e.g. English I
have said);
● the verb is inflected for person and number of the subject, but subject
pronouns may not be dropped even when this would be unambiguous
● Some features that are common in European langs but also found elsewhere:
○ lack of distinction between inclusive and exclusive first-person plural pronouns ("we
and you" vs. "we and not you")
○ lack of distinction between alienable and inalienable (e.g. body part) possession;
SAE was introduced by Whorf, 1939
Most NLP is Standard Average European
Morphology: Position of Case Affixes (WALS 51A)
SAE was introduced by Whorf, 1939
Suffixing morphology: Eurasian
and Australian languages;
Prefixation: Mesoamerican
languages and African languages
spoken below the Sahara
Most NLP is Standard Average European
Morphology: Inclusive/Exclusive Distinction in Independent Pronouns (WALS 39A)
SAE was introduced by Whorf, 1939
No Incl.excl.: Eurasian
Incl/excl.: Australian and South
American languages and;
Most NLP is Standard Average European
Morphology: Possessive Classification (WALS 59A)
SAE was introduced by Whorf, 1939
No Possessive: Eurasian
Possessive: Australian, African,
American;
Most NLP is Standard Average European
Morphology: The Past Tense (WALS 66A)
SAE was introduced by Whorf, 1939
Past tense, 1: Eurasian
No past tense: South-East Asia,
African, American;
UniMorph: Universal
Morphosyntactic annotation
Wiktionary: > 8000 language codes (incl.
historic)
https://en.wiktionary.org/wiki/Wiktionary:List_of_languages
Wiktionary: language-specific
беглец (“male escapee/runaway”) + pos=N, case=ACC, number=SG → беглеца
Functionally similar feature
names (cases, numbers, etc )
differ across languages!
Highly depends on the
descriptive tradition used to
document/describe a
language!
UniMorph: Universal Morphology
1) 23 dimensions of meaning (TAM, case, number, animacy), 212
features
2) A-morphous morphology (Anderson, 1992)
3) Paradigms extracted from English Wiktionary (Kirov et al., 2016)
https://unimorph.github.io/doc/unimorph-schema.pdf
By John Sylak-Glassman, 2015
UniMorph: Universal Morphology
https://unimorph.github.io/doc/unimorph-schema.pdf
A Russian noun declension:
Lemma Form UM Tags/Features
From Language-specific to
Universal features
Descriptive categories (specific to languages) vs. comparative concepts
(Haspelmath, 2010)
Gender
Noun class Bantu Languages: approx.
23 noun classes
Nakh-Daghestanian
Languages: 2–8 classes
● Masculine
● Feminine
● Neuter
Corbett, Greville G. 1991. Gender. Cambridge,
UK: Cambridge University Press.
Gender
Noun class
Bantu Languages: approx.
23 noun classes
● Masculine
● Feminine
● Neuter
Corbett, Greville G. 1991. Gender. Cambridge,
UK: Cambridge University Press.
Case Systems:
Core
Case Systems:
Non-Core
Semblative (like smth, e.g.
English “human-like”;
Evenki) → COMPV
Abessive (English “-less”,
“without” ; Uralic) → PRIV
Causal case (“because of
this…”; Moksha) → LGSPEC
Prepositional case
(location, “about *”; Slavic)
→ ESS
XXXX
Case Systems:
Local
Uralic:
Inessive (“in”) → IN+ESS
Elative (“out of”) → IN+ABL
Illative (“into”) → IN +ALL
Allative (“onto”) → AT+ALL
Essive → FRML
Pl– Place; Dst – Distal; Mot – Motion; Asp –
Aspect
Case Systems: challenges in UniMorph
Cases (and other features) combinations:
Evenki: Accusative reflexive-genitive (ACC+PSSRS); Accusative Definite (ACC+DEF)/Indefinite
(ACC+INDF)
Case compounding:
Uralic(e.g. Beserman Udmurt;
from Maria Usacheva)
Case stacking:
Kayardild (up to 6 cases!):
Also in Chukchi, Sumerian, Basque, and others
Scripts and Writing Systems
A language may use different scripts over
time
… and it’s often a political question!
1. E.g., USSR invented many writing
systems for various indigenous
peoples that didn’t have their own
system
2. Still, some republics have ongoing
debates on the system they
should use (e.g., Latin- vs.
Cyrillic-based)
A language may use different scripts over
timeLanguage Arabic Latin Cyrillic Other
Bashkir < end of 1920s 1930s 1940–
Kazakh < end of 1920s 1930s 1940–
Tatar < end of 1920s 1930s;2000s 1940–
Uzbek < end of 1920s 1930s; 1990– 1940–
Uyghur < end of 1920s; but still
used in China
1930–1940s 1946–
Buryat 1930s End of 1930s– <1920s: Mongolian
Chukchi 1930s 1937– Tenevil
Udmurt 1930s end of 18th cent. —
Evenki 1930s 1937– Mongolian (China)
A language may use different scripts over
timeLanguage Arabic-based Latin-based Cyrillic-based Other
Bashkir < end of 1920s 1930s 1940–
Kazakh < end of 1920s 1930s 1940–
Tatar < end of 1920s 1930s;2000s 1940–
Uzbek < end of 1920s 1930s; 1990– 1940–
Uyghur < end of 1920s; but still
used in China
1930–1940s 1946–
Buryat 1930s End of 1930s– <1920s: Mongolian
Chukchi 1930s 1937– Tenevil
Udmurt 1930s end of 18th cent. —
Evenki 1930s 1937– Mongolian (China)
May consider all existing scripts and data
but better to check with native speakers
which one would be preferred !
SIGMORPHON Shared Task
on Morphological
Reinflection
SIGMORPHON Shared Task on
Morphological (Re-)Inflection
Lemma Tag Form
RUN V;PAST ran
RUN V;PRES;1;SG run
RUN V;PRES;2;SG run
RUN V;PRES;3;SG runs
RUN V;PRES;PL run
RUN V;PART running
Inflection: RUN + V;PST → ran
reinflection: running +V;PST → ran
Cotterell et al., 2016–2018
McCarthy et al., 2019
Vylomova et al., 2020
Pimentel, Ryskina et al, 2021
SIGMORPHON Shared Task on
Morphological (Re-)Inflection
Lemma Tag Form
RUN V;PAST ran
RUN V;PRES;1;SG run
RUN V;PRES;2;SG run
RUN V;PRES;3;SG runs
RUN V;PRES;PL run
RUN V;PART running
Inflection: RUN + V;PST → ran
reinflection: running +V;PST → ran
Approx. 96% avg. accuracy on
high-resource languages!
Significantly less in under-resourced
languages!
Winning systems are neural seq2seq
models
See more details in my SIGTYP Talk
Error Taxonomy (Gorman et al., 2019)
● Free variation error: more than one acceptable form exists
● Extraction errors: flaws in UniMorph’s parsing of Wiktionary
● Wiktionary errors: errors in Wiktionary/ ling. source data itself
● Silly errors: “bizarre” errors which defy any purely linguistic
characterization (``*membled'' instead of ``mailed'' or enters a
loop such as ``ynawemaylmyylmyylmyylmyylmyylmyym...''
instead of ``ysnewem'')
● Allomorphy errors: misapplication of existing allomorphic
patterns
● Spelling errors: forms that do not follow language-specific
orthographic conventions
Error Taxonomy (Gorman et al., 2019)
Majority of errors are due to allomorphy
Allomorphy Errors
● Stem-final vowels in Finnish (*pohjanpystykorvojen); Consonant gradation in Finnish (*ei kiemurda)
● Ablaut in Dutch and German (*pront; *saufte)
● Umlaut (*Einwohnerzähle, *Förmer), plural suffixes, Verbal prefixes in German (*umkehre)
● Linking vowels in Hungarian (*masszázsakból instead of *masszázsokból)
● Yers (*klęsek instead of klęsk), Genitive singular suffixes in Polish (*izotopa)
● Animacy in Polish and Russian (грузин vs. магазин in ACC.SG )
● Aspect in Russian (*будешь сорвать)
● Internal inflection in Russian compounds (*государствах-донорах, *лёгких промышленности
(ACC.PL))
Allomorphy Errors
● Stem-final vowels in Finnish (*pohjanpystykorvojen); Consonant gradation in Finnish (*ei kiemurda)
● Ablaut in Dutch and German (*pront; *saufte)
● Umlaut (*Einwohnerzähle, *Förmer), plural suffixes, Verbal prefixes in German (*umkehre)
● Linking vowels in Hungarian (*masszázsakból instead of *masszázsokból)
● Yers (*klęsek instead of klęsk), Genitive singular suffixes in Polish (*izotopa)
● Animacy in Polish and Russian (грузин vs. магазин in ACC.SG )
● Aspect in Russian (*будешь сорвать)
● Internal inflection in Russian compounds (*государствах-донорах, *лёгких промышленности
(ACC.PL))
Poor Performance on unseen lemmas
Some writing systems are more challenging!
Tibetan:
○ Nonce words and impossible combinations of component units (Di et al., 2019)
A Case Study on Nen (PNG; Muradoglu et al.,
2020)
● Spoken in the village of Bimadbn in the
Western Province of PNG, by approx
400 people
● Verbs: prefixing, middle, and ambifixing
● Distributed Exponence (DE);
“morphosyntactic feature values can
only be determined after unification of
multiple structural positions"
A Case Study on Nen (PNG; Muradoglu et al.,
2020)
● Low accuracy on small number of samples
(<1000)
● Allomorphy: vowel harmony
● Looping:
*ynawemaylmyylmyylmyylmy-ylmyylmyymaya
mawemyymamya
How well do they generalize?
Syncretism Test: all the TAM categories exhibit
syncretism across the second and third-person
singular actor. Exception: The past perfective slot
(where they take different forms)
Not observing the past perfective forms, systems
tend to predict the forms as syncretic
(generalizing from observed slots), resulting in the
misprediction of the actual forms (exceptions)
Using machine learning to
evaluate our data quality
Experiment
data
15
25
35
22
What are the most “challenging”
languages?
● Syc: Syriac (train: 1217 rows/534
lemmas)
● Ckt: Chukchi (132 rows/118 lemmas)
● Itl: Itelmen (1246 rows/849 lemmas)
● Gup: Kunwinjku (214 rows/62
lemmas)
● Bra: Braj (1082 rows/825 lemmas)
● Ail: Eibela (918 rows/470 lemmas)
● Evn: Evenki (5216 rows/2919
lemmas)
● Sjo: Xibe (290 rows/206 lemmas)
ST 2021
Experiment
data
15
25
35
22
Evenki :
● The dataset has been created from oral
speech samples
● Little attempt at any standardization in the
oral speech transcription
● The Evenki language is known to have rich
dialectal variation
● The vowel harmony is complex: not all
suffixes obey it, and it is also
dialect-dependent
Schema:
● Various past tense forms are all annotated as
PST
● Several comitative suffixes all annotated as
COM
● Some features are present in the word form
but they receive no annotation at all
Elena Klyachko
ST 2021
Experiment
data
15
25
35
22
Chukchi (Luorawetlat):
● The dataset has been created from oral
speech samples
● Very small and sparse dataset
● Polypersonal agreement
● very productive incorporation
English analogue: I am
bookreading/dishwashing/fruitcutting.
● Complex vowel harmony:
Morphemes: "plus" with [a], [o], [e], [ə] and
"minus" with [i], [e], [u], [ə]
If a word with minus morphemes gets an affix
with plus morphemes, all the minus ones turn
into plus: [i] -> [e], [e] -> [a], [u]--> [o].
Нутэнут + N;DAT → нотагты
Schema:
● No attempt to split genderlects
«walrus»: male рыркы (rərkə) — female
цыццы tsəttsə
Elena Klyachko, Maria Ryskina
ST 2021
Experiment
data
15
25
35
22
Itelmen:
● Very small and sparse dataset
● Verbs mark both subjects (with prefixes and
suffixes) and objects (with suffixes); hard to
capture combinations:
Gold Tag Predicted
čʼelisčiŋnen V;NO3S+AB3S;FIN;IND;PST čʼelnen
enčʼiɬivnen V;NO3S+AB3S;FIN;IND;PRS;CAUS ənčʼiɬiznen
nəntxlakninˀin V;NO3P+AB3S;FIN;FOC;IND;PST
nəntxlakisxenˀin
● Complex phono- and morphotactics
ktavolknan V.CVB;NFIN ktavolˀin
ktekejknen V.CVB;NFIN ktekejˀin
Karina Sheifer, Sofya Ganieva,
Matvey Plugaryov
ST 2021
Experiment
data
15
25
35
22
Kunwinjku
● Very small and sparse dataset
● Complex phonotactics:
borlbme+V;2;PL;Non-PST → *ngurriborlbme
(instead of ngurriborle)
● Falling into loops:
*ngarrrrrrrrrrrrrrmbbbijj (should be
karribelbmerrinj)
or *ngadjarridarrkddrrdddrrmerri (should be
karriyawoyhdjarrkbidyikarrmerrimeninj)
William Lane, Steven Bird
ST 2021
Experiment
data
15
25
35
22
Eibela
● Very small and sparse dataset (comes from inter-linear texts)
● Complex features, many of which didn’t have accurate
UniMorph analogues
ASS (associated place) → Approximative APPRX
ASS.EV (associated event) → Intransitive INTR
A.NOM (Agent nominalization) → Noun N
CONTINUOUS/PROG →Progressive PROG
COORD (Coordinator) → Comitative COM
DEL.IMP: DEL (Delayed imperative → Imperative-Jussive IMP ;
Future FUT
DEO.DUR (deontic.durative) →Obligative OBLIG ; durative DUR
D.S (different subject) →DS
HYPO (hypothetical) → Irrealis IRR
PERFECT/RESULTATIVE →Perfect PRF
PROHIB → IMP;NEG
PROGRESSIVE → PROG
SIM/SIM, S.S (simultaneous) → Simultaneous Multiclausal Aspect
SIMMA
ST.IMP (Strong imperative) → Imperative-Jussive Mood ; Remote
REMT
TEL.EV (Telic) → Telic TEL
UNCERT (Uncertain) → Potential POT
● Misprediction of vowel length:
to:mulu+N;ERG -->tomulε (should be of
to:mu:lε:)
Grant Aiton
ST 2021
Which language data can be
used for typological studies
(e.g., compare
morphological complexity)?
Approx.
amount of data
for each
ST2021
language
Approx.
amount of data
for each
ST2021
language
Polysynthetic languages
represented by a few samples
A few samples from glosses;
No full paradigms
Mainly full paradigms
Chukchi data contains:
Only 3 out 9 gram. cases;
3 out of 6 tenses
Approx.
amount of data
for each
ST2021
language
Polysynthetic languages
represented by a few samples
A few samples from glosses;
No full paradigms
Mainly full paradigms
Chukchi data contains:
Only 3 out 9 gram. cases;
3 out of 6 tenses
Check how representative the dataset is
before you use it for typological studies!
Current challenges and
open questions
Paradigms in polysynthetic languages
Currently we have very limited representation of paradigms in
polysynthetic languages
Not clear what a potential paradigm should contain
Derivation – Inflection continuum
No strict boundary between inflection and derivation
Derivational morphology also represents systematicity and
regularity (English agentive “-er”, absence “-less”, ability
“-able”)
No clear dimensions of derivational meanings
Morphology and Syntax
Should we include clitics, copulas, MWEs?
If we do, the size of paradigm tables grows exponentially
What might be an efficient way to store them?
Thank you!
Join and follow us:
https://twitter.com/unimorph_/

More Related Content

What's hot

Syntax & Stylistics 1
Syntax & Stylistics 1Syntax & Stylistics 1
Syntax & Stylistics 1Rick McKinnon
 
Comparing characteristics of old and middle english
Comparing characteristics of old and middle englishComparing characteristics of old and middle english
Comparing characteristics of old and middle englishAbdel-Fattah Adel
 
презентация по истории английского языка по теме прилагательное как часть реч...
презентация по истории английского языка по теме прилагательное как часть реч...презентация по истории английского языка по теме прилагательное как часть реч...
презентация по истории английского языка по теме прилагательное как часть реч...irinaborovik2013
 
M1 lesson 1.1 slides
M1 lesson 1.1 slidesM1 lesson 1.1 slides
M1 lesson 1.1 slidesAnh Le
 
An a b-c intro to canto for total new speakers
An a b-c intro to canto for total new speakersAn a b-c intro to canto for total new speakers
An a b-c intro to canto for total new speakersBangulzai
 
Constituency, Trees and Rules
Constituency, Trees and Rules Constituency, Trees and Rules
Constituency, Trees and Rules Eman Al Husaiyan
 
Case (group 5)
Case (group 5)Case (group 5)
Case (group 5)rikanissa
 
English Grammar Lecture 12: Transitive Phrasal Verbs
English Grammar Lecture 12: Transitive Phrasal VerbsEnglish Grammar Lecture 12: Transitive Phrasal Verbs
English Grammar Lecture 12: Transitive Phrasal VerbsEd McCorduck
 
MORPHOLOPGY-Call
MORPHOLOPGY-CallMORPHOLOPGY-Call
MORPHOLOPGY-Callmaysarie
 
Long short form adjectives - sarajevo09 - presentation
Long short form adjectives - sarajevo09 - presentationLong short form adjectives - sarajevo09 - presentation
Long short form adjectives - sarajevo09 - presentationbarsenijevic
 
Idiosynchratic constructions in English and Spanish
Idiosynchratic constructions in English and SpanishIdiosynchratic constructions in English and Spanish
Idiosynchratic constructions in English and SpanishRomina Marazzato Sparano
 

What's hot (18)

Agreementññp36
Agreementññp36Agreementññp36
Agreementññp36
 
Syntax & Stylistics 1
Syntax & Stylistics 1Syntax & Stylistics 1
Syntax & Stylistics 1
 
Comparing characteristics of old and middle english
Comparing characteristics of old and middle englishComparing characteristics of old and middle english
Comparing characteristics of old and middle english
 
Case (Revisi)
Case (Revisi)Case (Revisi)
Case (Revisi)
 
презентация по истории английского языка по теме прилагательное как часть реч...
презентация по истории английского языка по теме прилагательное как часть реч...презентация по истории английского языка по теме прилагательное как часть реч...
презентация по истории английского языка по теме прилагательное как часть реч...
 
M1 lesson 1.1 slides
M1 lesson 1.1 slidesM1 lesson 1.1 slides
M1 lesson 1.1 slides
 
SYNTAX - head and modifiers
SYNTAX - head and modifiersSYNTAX - head and modifiers
SYNTAX - head and modifiers
 
An a b-c intro to canto for total new speakers
An a b-c intro to canto for total new speakersAn a b-c intro to canto for total new speakers
An a b-c intro to canto for total new speakers
 
Constituency, Trees and Rules
Constituency, Trees and Rules Constituency, Trees and Rules
Constituency, Trees and Rules
 
Case ppt
Case pptCase ppt
Case ppt
 
Tree diagrams
Tree diagramsTree diagrams
Tree diagrams
 
Case (group 5)
Case (group 5)Case (group 5)
Case (group 5)
 
English Grammar Lecture 12: Transitive Phrasal Verbs
English Grammar Lecture 12: Transitive Phrasal VerbsEnglish Grammar Lecture 12: Transitive Phrasal Verbs
English Grammar Lecture 12: Transitive Phrasal Verbs
 
MORPHOLOPGY-Call
MORPHOLOPGY-CallMORPHOLOPGY-Call
MORPHOLOPGY-Call
 
Call
CallCall
Call
 
Syntax Notes
Syntax NotesSyntax Notes
Syntax Notes
 
Long short form adjectives - sarajevo09 - presentation
Long short form adjectives - sarajevo09 - presentationLong short form adjectives - sarajevo09 - presentation
Long short form adjectives - sarajevo09 - presentation
 
Idiosynchratic constructions in English and Spanish
Idiosynchratic constructions in English and SpanishIdiosynchratic constructions in English and Spanish
Idiosynchratic constructions in English and Spanish
 

Similar to Documenting and modeling inflectional paradigms in under-resourced languages

181822167 ukrainian-lessons-doc
181822167 ukrainian-lessons-doc181822167 ukrainian-lessons-doc
181822167 ukrainian-lessons-dochomeworkping10
 
Language Comparison (Korean, Japanese and English)
Language Comparison (Korean, Japanese and English)Language Comparison (Korean, Japanese and English)
Language Comparison (Korean, Japanese and English)MIN KYUNG LEE
 
Presentation On Weak Syllables
Presentation On Weak SyllablesPresentation On Weak Syllables
Presentation On Weak SyllablesDr. Cupid Lucid
 
Copy Of Presentation On Weak Syllables
Copy Of Presentation On Weak SyllablesCopy Of Presentation On Weak Syllables
Copy Of Presentation On Weak SyllablesDr. Cupid Lucid
 
Presentation On Weak Syllables
Presentation On Weak SyllablesPresentation On Weak Syllables
Presentation On Weak SyllablesDr. Cupid Lucid
 
Presentation On Weak Syllables
Presentation On Weak SyllablesPresentation On Weak Syllables
Presentation On Weak SyllablesDr. Cupid Lucid
 
Copy Of Presentation On Weak Syllables
Copy Of Presentation On Weak SyllablesCopy Of Presentation On Weak Syllables
Copy Of Presentation On Weak SyllablesDr. Cupid Lucid
 
Copy (2) Of Presentation On Weak Syllables
Copy (2) Of Presentation On Weak SyllablesCopy (2) Of Presentation On Weak Syllables
Copy (2) Of Presentation On Weak SyllablesDr. Cupid Lucid
 
Phonology 3
Phonology 3Phonology 3
Phonology 3mpaviour
 
Does the verb come last in your languages
Does the verb come last in your languagesDoes the verb come last in your languages
Does the verb come last in your languagesSamRobert9
 
Does the verb come last in your languages
Does the verb come last in your languagesDoes the verb come last in your languages
Does the verb come last in your languagesSamRobert9
 
Chapter 11.1&2 Spring 2023.pptx
Chapter 11.1&2 Spring 2023.pptxChapter 11.1&2 Spring 2023.pptx
Chapter 11.1&2 Spring 2023.pptxbrianjars
 
Sounds3.pptx
Sounds3.pptxSounds3.pptx
Sounds3.pptxbrianjars
 
Language & gender presentation
Language & gender presentationLanguage & gender presentation
Language & gender presentationHasan BİLOKCUOGLU
 

Similar to Documenting and modeling inflectional paradigms in under-resourced languages (20)

181822167 ukrainian-lessons-doc
181822167 ukrainian-lessons-doc181822167 ukrainian-lessons-doc
181822167 ukrainian-lessons-doc
 
2 allomorphy
2 allomorphy2 allomorphy
2 allomorphy
 
Language Comparison (Korean, Japanese and English)
Language Comparison (Korean, Japanese and English)Language Comparison (Korean, Japanese and English)
Language Comparison (Korean, Japanese and English)
 
Phonetics Assignment3
Phonetics Assignment3Phonetics Assignment3
Phonetics Assignment3
 
Weak,Strong Syllables
Weak,Strong SyllablesWeak,Strong Syllables
Weak,Strong Syllables
 
Affixes
AffixesAffixes
Affixes
 
Presentation On Weak Syllables
Presentation On Weak SyllablesPresentation On Weak Syllables
Presentation On Weak Syllables
 
Copy Of Presentation On Weak Syllables
Copy Of Presentation On Weak SyllablesCopy Of Presentation On Weak Syllables
Copy Of Presentation On Weak Syllables
 
Presentation On Weak Syllables
Presentation On Weak SyllablesPresentation On Weak Syllables
Presentation On Weak Syllables
 
Presentation On Weak Syllables
Presentation On Weak SyllablesPresentation On Weak Syllables
Presentation On Weak Syllables
 
Copy Of Presentation On Weak Syllables
Copy Of Presentation On Weak SyllablesCopy Of Presentation On Weak Syllables
Copy Of Presentation On Weak Syllables
 
Copy (2) Of Presentation On Weak Syllables
Copy (2) Of Presentation On Weak SyllablesCopy (2) Of Presentation On Weak Syllables
Copy (2) Of Presentation On Weak Syllables
 
Expression of Several Grammatical Meanings in Oral vs. Graphical Constructed ...
Expression of Several Grammatical Meanings in Oral vs. Graphical Constructed ...Expression of Several Grammatical Meanings in Oral vs. Graphical Constructed ...
Expression of Several Grammatical Meanings in Oral vs. Graphical Constructed ...
 
Phonology 3
Phonology 3Phonology 3
Phonology 3
 
Does the verb come last in your languages
Does the verb come last in your languagesDoes the verb come last in your languages
Does the verb come last in your languages
 
Does the verb come last in your languages
Does the verb come last in your languagesDoes the verb come last in your languages
Does the verb come last in your languages
 
Chapter 11.1&2 Spring 2023.pptx
Chapter 11.1&2 Spring 2023.pptxChapter 11.1&2 Spring 2023.pptx
Chapter 11.1&2 Spring 2023.pptx
 
Sounds3.pptx
Sounds3.pptxSounds3.pptx
Sounds3.pptx
 
Language & gender presentation
Language & gender presentationLanguage & gender presentation
Language & gender presentation
 
Azeri language
Azeri languageAzeri language
Azeri language
 

More from Katerina Vylomova

The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...Katerina Vylomova
 
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological InflectionSIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological InflectionKaterina Vylomova
 
Ekaterina vylomova-what-do-neural models-know-about-language-p2
Ekaterina vylomova-what-do-neural models-know-about-language-p2Ekaterina vylomova-what-do-neural models-know-about-language-p2
Ekaterina vylomova-what-do-neural models-know-about-language-p2Katerina Vylomova
 
Ekaterina vylomova-what-do-neural models-know-about-language-p1
Ekaterina vylomova-what-do-neural models-know-about-language-p1Ekaterina vylomova-what-do-neural models-know-about-language-p1
Ekaterina vylomova-what-do-neural models-know-about-language-p1Katerina Vylomova
 
Evaluation of Semantic Change of Harm-Related Concepts in Psychology
Evaluation of Semantic Change of Harm-Related Concepts in PsychologyEvaluation of Semantic Change of Harm-Related Concepts in Psychology
Evaluation of Semantic Change of Harm-Related Concepts in PsychologyKaterina Vylomova
 
Contextualization of Morphological Inflection
Contextualization of Morphological InflectionContextualization of Morphological Inflection
Contextualization of Morphological InflectionKaterina Vylomova
 
Paradigm Completion for Derivational Morphology
Paradigm Completion for Derivational MorphologyParadigm Completion for Derivational Morphology
Paradigm Completion for Derivational MorphologyKaterina Vylomova
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingKaterina Vylomova
 
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...Katerina Vylomova
 
Context-Aware Derivation Prediction // EACL 2017
Context-Aware Derivation Prediction // EACL 2017Context-Aware Derivation Prediction // EACL 2017
Context-Aware Derivation Prediction // EACL 2017Katerina Vylomova
 
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...Katerina Vylomova
 
Neural models for recognition of basic units of semiographic chants
Neural models for recognition of basic units of semiographic chantsNeural models for recognition of basic units of semiographic chants
Neural models for recognition of basic units of semiographic chantsKaterina Vylomova
 
Russia, Russians and Russian language
Russia, Russians and Russian languageRussia, Russians and Russian language
Russia, Russians and Russian languageKaterina Vylomova
 
Ekaterina Vylomova/Brown Bag seminar presentation
Ekaterina Vylomova/Brown Bag seminar presentationEkaterina Vylomova/Brown Bag seminar presentation
Ekaterina Vylomova/Brown Bag seminar presentationKaterina Vylomova
 

More from Katerina Vylomova (15)

The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
The Secret Life of Words: Exploring Regularity and Systematicity (joint talk ...
 
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological InflectionSIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
 
Ekaterina vylomova-what-do-neural models-know-about-language-p2
Ekaterina vylomova-what-do-neural models-know-about-language-p2Ekaterina vylomova-what-do-neural models-know-about-language-p2
Ekaterina vylomova-what-do-neural models-know-about-language-p2
 
Ekaterina vylomova-what-do-neural models-know-about-language-p1
Ekaterina vylomova-what-do-neural models-know-about-language-p1Ekaterina vylomova-what-do-neural models-know-about-language-p1
Ekaterina vylomova-what-do-neural models-know-about-language-p1
 
Evaluation of Semantic Change of Harm-Related Concepts in Psychology
Evaluation of Semantic Change of Harm-Related Concepts in PsychologyEvaluation of Semantic Change of Harm-Related Concepts in Psychology
Evaluation of Semantic Change of Harm-Related Concepts in Psychology
 
Contextualization of Morphological Inflection
Contextualization of Morphological InflectionContextualization of Morphological Inflection
Contextualization of Morphological Inflection
 
Paradigm Completion for Derivational Morphology
Paradigm Completion for Derivational MorphologyParadigm Completion for Derivational Morphology
Paradigm Completion for Derivational Morphology
 
Contemporary Models of Natural Language Processing
Contemporary Models of Natural Language ProcessingContemporary Models of Natural Language Processing
Contemporary Models of Natural Language Processing
 
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal A...
 
Context-Aware Derivation Prediction // EACL 2017
Context-Aware Derivation Prediction // EACL 2017Context-Aware Derivation Prediction // EACL 2017
Context-Aware Derivation Prediction // EACL 2017
 
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vec...
 
Working with text data
Working with text dataWorking with text data
Working with text data
 
Neural models for recognition of basic units of semiographic chants
Neural models for recognition of basic units of semiographic chantsNeural models for recognition of basic units of semiographic chants
Neural models for recognition of basic units of semiographic chants
 
Russia, Russians and Russian language
Russia, Russians and Russian languageRussia, Russians and Russian language
Russia, Russians and Russian language
 
Ekaterina Vylomova/Brown Bag seminar presentation
Ekaterina Vylomova/Brown Bag seminar presentationEkaterina Vylomova/Brown Bag seminar presentation
Ekaterina Vylomova/Brown Bag seminar presentation
 

Recently uploaded

electrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxelectrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxlCWU
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Sérgio Sacani
 
Mitosis...............................pptx
Mitosis...............................pptxMitosis...............................pptx
Mitosis...............................pptxCherry
 
METHODS OF TRANSCRIPTOME ANALYSIS....pptx
METHODS OF TRANSCRIPTOME ANALYSIS....pptxMETHODS OF TRANSCRIPTOME ANALYSIS....pptx
METHODS OF TRANSCRIPTOME ANALYSIS....pptxCherry
 
Cell Immobilization Methods and Applications.pptx
Cell Immobilization Methods and Applications.pptxCell Immobilization Methods and Applications.pptx
Cell Immobilization Methods and Applications.pptxCherry
 
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesGBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesAreesha Ahmad
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPirithiRaju
 
A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthSérgio Sacani
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptsreddyrahul
 
Phytogeography........................pptx
Phytogeography........................pptxPhytogeography........................pptx
Phytogeography........................pptxCherry
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureSérgio Sacani
 
Unveiling The Crucial Role Of Cobalt In Plant
Unveiling The Crucial Role Of Cobalt In PlantUnveiling The Crucial Role Of Cobalt In Plant
Unveiling The Crucial Role Of Cobalt In PlantHimanshu Pandey
 
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...frank0071
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Sérgio Sacani
 
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)Areesha Ahmad
 
Structural annotation................pptx
Structural annotation................pptxStructural annotation................pptx
Structural annotation................pptxCherry
 
Tissue engineering......................pptx
Tissue engineering......................pptxTissue engineering......................pptx
Tissue engineering......................pptxCherry
 
INSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversityINSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversitySteffi Friedrichs
 
Hemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. MuralinathHemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. Muralinathmuralinath2
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...Health Advances
 

Recently uploaded (20)

electrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxelectrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptx
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
 
Mitosis...............................pptx
Mitosis...............................pptxMitosis...............................pptx
Mitosis...............................pptx
 
METHODS OF TRANSCRIPTOME ANALYSIS....pptx
METHODS OF TRANSCRIPTOME ANALYSIS....pptxMETHODS OF TRANSCRIPTOME ANALYSIS....pptx
METHODS OF TRANSCRIPTOME ANALYSIS....pptx
 
Cell Immobilization Methods and Applications.pptx
Cell Immobilization Methods and Applications.pptxCell Immobilization Methods and Applications.pptx
Cell Immobilization Methods and Applications.pptx
 
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of CarbohydratesGBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
GBSN - Biochemistry (Unit 4) Chemistry of Carbohydrates
 
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
 
A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on Earth
 
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynypptAerodynamics. flippatterncn5tm5ttnj6nmnynyppt
Aerodynamics. flippatterncn5tm5ttnj6nmnynyppt
 
Phytogeography........................pptx
Phytogeography........................pptxPhytogeography........................pptx
Phytogeography........................pptx
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
Unveiling The Crucial Role Of Cobalt In Plant
Unveiling The Crucial Role Of Cobalt In PlantUnveiling The Crucial Role Of Cobalt In Plant
Unveiling The Crucial Role Of Cobalt In Plant
 
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
Ostiguy & Panizza & Moffitt (eds.) - Populism in Global Perspective. A Perfor...
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
 
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)
GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)GBSN -  Microbiology Lab  1 (Microbiology Lab Safety Procedures)
GBSN - Microbiology Lab 1 (Microbiology Lab Safety Procedures)
 
Structural annotation................pptx
Structural annotation................pptxStructural annotation................pptx
Structural annotation................pptx
 
Tissue engineering......................pptx
Tissue engineering......................pptxTissue engineering......................pptx
Tissue engineering......................pptx
 
INSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere UniversityINSIGHT Partner Profile: Tampere University
INSIGHT Partner Profile: Tampere University
 
Hemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. MuralinathHemoglobin metabolism: C Kalyan & E. Muralinath
Hemoglobin metabolism: C Kalyan & E. Muralinath
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 

Documenting and modeling inflectional paradigms in under-resourced languages

  • 1. Documenting and Modeling Inflectional Paradigms in Under-resourced Languages By Ekaterina Vylomova evylomova@gmail.com Cardamom Seminar Series Feb, 1 2022
  • 2. NLP: Universal or Euroversal?
  • 4. Approx. 7,000* languages in the world… * Language vs. dialect distinction: “A language is a dialect with its own army and navy” (Max Weinreich)
  • 5. Roman Jakobson on differences between Languages ``Languages differ essentially in what they must convey and not in what they may convey''
  • 6. Chinese (Isolating; strict word order) wǒmen xué le zhè xiē shēngcí. I.PL.AN learn .PAST this .PL new word. ``We learned these new words.''
  • 7. Russian (Synthetic;flexible word order) My vyučili eti novyje slova. We learn.PAST.PL this.ACC.PL new.ACC.PL word.ACC.PL ``We learned these new words.''
  • 8. Nannu-n-niuti-kkuminar-tu-rujussu-u-vuq. Polar.bear-catch-instrument.for.achieving-something.good.for-PART- big-be-3SG.INDIC ``It (a dog) is good for catching polar bears with.'' Speedtalk in “Gulf” by Robert Heinlein West Greenlandic (Polysynthetic; Fortescue (2017))
  • 9. Aban-yawoith-warrgah-marne-ganj-ginje-ng. 1/3PL-again-wrong-BEN-meat-cook-PP ``I cooked the wrong meat for them again'' Kunwinjku (Polysynthetic; Evans (2003))
  • 10. Approx. 7,000* languages in the world… … But majority of NLP technologies only focus on most documented languages such as English, French, German, Russian, Hindi, or Finnish * Language vs. dialect distinction: “A language is a dialect with its own army and navy” (Max Weinreich)
  • 11. Most NLP is Standard Average European … According to Martin Haspelmath (2001), “euroversals” share: ● definite and indefinite articles (e.g. English the vs. a) ● a periphrastic perfect formed with 'have' plus a passive participle (e.g. English I have said); ● the verb is inflected for person and number of the subject, but subject pronouns may not be dropped even when this would be unambiguous ● Some features that are common in European langs but also found elsewhere: ○ lack of distinction between inclusive and exclusive first-person plural pronouns ("we and you" vs. "we and not you") ○ lack of distinction between alienable and inalienable (e.g. body part) possession; SAE was introduced by Whorf, 1939
  • 12. Most NLP is Standard Average European Morphology: Position of Case Affixes (WALS 51A) SAE was introduced by Whorf, 1939 Suffixing morphology: Eurasian and Australian languages; Prefixation: Mesoamerican languages and African languages spoken below the Sahara
  • 13. Most NLP is Standard Average European Morphology: Inclusive/Exclusive Distinction in Independent Pronouns (WALS 39A) SAE was introduced by Whorf, 1939 No Incl.excl.: Eurasian Incl/excl.: Australian and South American languages and;
  • 14. Most NLP is Standard Average European Morphology: Possessive Classification (WALS 59A) SAE was introduced by Whorf, 1939 No Possessive: Eurasian Possessive: Australian, African, American;
  • 15. Most NLP is Standard Average European Morphology: The Past Tense (WALS 66A) SAE was introduced by Whorf, 1939 Past tense, 1: Eurasian No past tense: South-East Asia, African, American;
  • 17. Wiktionary: > 8000 language codes (incl. historic) https://en.wiktionary.org/wiki/Wiktionary:List_of_languages
  • 18. Wiktionary: language-specific беглец (“male escapee/runaway”) + pos=N, case=ACC, number=SG → беглеца Functionally similar feature names (cases, numbers, etc ) differ across languages! Highly depends on the descriptive tradition used to document/describe a language!
  • 19. UniMorph: Universal Morphology 1) 23 dimensions of meaning (TAM, case, number, animacy), 212 features 2) A-morphous morphology (Anderson, 1992) 3) Paradigms extracted from English Wiktionary (Kirov et al., 2016) https://unimorph.github.io/doc/unimorph-schema.pdf By John Sylak-Glassman, 2015
  • 20. UniMorph: Universal Morphology https://unimorph.github.io/doc/unimorph-schema.pdf A Russian noun declension: Lemma Form UM Tags/Features
  • 21. From Language-specific to Universal features Descriptive categories (specific to languages) vs. comparative concepts (Haspelmath, 2010)
  • 22. Gender Noun class Bantu Languages: approx. 23 noun classes Nakh-Daghestanian Languages: 2–8 classes ● Masculine ● Feminine ● Neuter Corbett, Greville G. 1991. Gender. Cambridge, UK: Cambridge University Press.
  • 23. Gender Noun class Bantu Languages: approx. 23 noun classes ● Masculine ● Feminine ● Neuter Corbett, Greville G. 1991. Gender. Cambridge, UK: Cambridge University Press.
  • 25. Case Systems: Non-Core Semblative (like smth, e.g. English “human-like”; Evenki) → COMPV Abessive (English “-less”, “without” ; Uralic) → PRIV Causal case (“because of this…”; Moksha) → LGSPEC Prepositional case (location, “about *”; Slavic) → ESS XXXX
  • 26. Case Systems: Local Uralic: Inessive (“in”) → IN+ESS Elative (“out of”) → IN+ABL Illative (“into”) → IN +ALL Allative (“onto”) → AT+ALL Essive → FRML Pl– Place; Dst – Distal; Mot – Motion; Asp – Aspect
  • 27. Case Systems: challenges in UniMorph Cases (and other features) combinations: Evenki: Accusative reflexive-genitive (ACC+PSSRS); Accusative Definite (ACC+DEF)/Indefinite (ACC+INDF) Case compounding: Uralic(e.g. Beserman Udmurt; from Maria Usacheva) Case stacking: Kayardild (up to 6 cases!): Also in Chukchi, Sumerian, Basque, and others
  • 29. A language may use different scripts over time … and it’s often a political question! 1. E.g., USSR invented many writing systems for various indigenous peoples that didn’t have their own system 2. Still, some republics have ongoing debates on the system they should use (e.g., Latin- vs. Cyrillic-based)
  • 30. A language may use different scripts over timeLanguage Arabic Latin Cyrillic Other Bashkir < end of 1920s 1930s 1940– Kazakh < end of 1920s 1930s 1940– Tatar < end of 1920s 1930s;2000s 1940– Uzbek < end of 1920s 1930s; 1990– 1940– Uyghur < end of 1920s; but still used in China 1930–1940s 1946– Buryat 1930s End of 1930s– <1920s: Mongolian Chukchi 1930s 1937– Tenevil Udmurt 1930s end of 18th cent. — Evenki 1930s 1937– Mongolian (China)
  • 31. A language may use different scripts over timeLanguage Arabic-based Latin-based Cyrillic-based Other Bashkir < end of 1920s 1930s 1940– Kazakh < end of 1920s 1930s 1940– Tatar < end of 1920s 1930s;2000s 1940– Uzbek < end of 1920s 1930s; 1990– 1940– Uyghur < end of 1920s; but still used in China 1930–1940s 1946– Buryat 1930s End of 1930s– <1920s: Mongolian Chukchi 1930s 1937– Tenevil Udmurt 1930s end of 18th cent. — Evenki 1930s 1937– Mongolian (China) May consider all existing scripts and data but better to check with native speakers which one would be preferred !
  • 32. SIGMORPHON Shared Task on Morphological Reinflection
  • 33. SIGMORPHON Shared Task on Morphological (Re-)Inflection Lemma Tag Form RUN V;PAST ran RUN V;PRES;1;SG run RUN V;PRES;2;SG run RUN V;PRES;3;SG runs RUN V;PRES;PL run RUN V;PART running Inflection: RUN + V;PST → ran reinflection: running +V;PST → ran Cotterell et al., 2016–2018 McCarthy et al., 2019 Vylomova et al., 2020 Pimentel, Ryskina et al, 2021
  • 34. SIGMORPHON Shared Task on Morphological (Re-)Inflection Lemma Tag Form RUN V;PAST ran RUN V;PRES;1;SG run RUN V;PRES;2;SG run RUN V;PRES;3;SG runs RUN V;PRES;PL run RUN V;PART running Inflection: RUN + V;PST → ran reinflection: running +V;PST → ran Approx. 96% avg. accuracy on high-resource languages! Significantly less in under-resourced languages! Winning systems are neural seq2seq models See more details in my SIGTYP Talk
  • 35. Error Taxonomy (Gorman et al., 2019) ● Free variation error: more than one acceptable form exists ● Extraction errors: flaws in UniMorph’s parsing of Wiktionary ● Wiktionary errors: errors in Wiktionary/ ling. source data itself ● Silly errors: “bizarre” errors which defy any purely linguistic characterization (``*membled'' instead of ``mailed'' or enters a loop such as ``ynawemaylmyylmyylmyylmyylmyylmyym...'' instead of ``ysnewem'') ● Allomorphy errors: misapplication of existing allomorphic patterns ● Spelling errors: forms that do not follow language-specific orthographic conventions
  • 36. Error Taxonomy (Gorman et al., 2019) Majority of errors are due to allomorphy
  • 37. Allomorphy Errors ● Stem-final vowels in Finnish (*pohjanpystykorvojen); Consonant gradation in Finnish (*ei kiemurda) ● Ablaut in Dutch and German (*pront; *saufte) ● Umlaut (*Einwohnerzähle, *Förmer), plural suffixes, Verbal prefixes in German (*umkehre) ● Linking vowels in Hungarian (*masszázsakból instead of *masszázsokból) ● Yers (*klęsek instead of klęsk), Genitive singular suffixes in Polish (*izotopa) ● Animacy in Polish and Russian (грузин vs. магазин in ACC.SG ) ● Aspect in Russian (*будешь сорвать) ● Internal inflection in Russian compounds (*государствах-донорах, *лёгких промышленности (ACC.PL))
  • 38. Allomorphy Errors ● Stem-final vowels in Finnish (*pohjanpystykorvojen); Consonant gradation in Finnish (*ei kiemurda) ● Ablaut in Dutch and German (*pront; *saufte) ● Umlaut (*Einwohnerzähle, *Förmer), plural suffixes, Verbal prefixes in German (*umkehre) ● Linking vowels in Hungarian (*masszázsakból instead of *masszázsokból) ● Yers (*klęsek instead of klęsk), Genitive singular suffixes in Polish (*izotopa) ● Animacy in Polish and Russian (грузин vs. магазин in ACC.SG ) ● Aspect in Russian (*будешь сорвать) ● Internal inflection in Russian compounds (*государствах-донорах, *лёгких промышленности (ACC.PL)) Poor Performance on unseen lemmas
  • 39. Some writing systems are more challenging! Tibetan: ○ Nonce words and impossible combinations of component units (Di et al., 2019)
  • 40. A Case Study on Nen (PNG; Muradoglu et al., 2020) ● Spoken in the village of Bimadbn in the Western Province of PNG, by approx 400 people ● Verbs: prefixing, middle, and ambifixing ● Distributed Exponence (DE); “morphosyntactic feature values can only be determined after unification of multiple structural positions"
  • 41. A Case Study on Nen (PNG; Muradoglu et al., 2020) ● Low accuracy on small number of samples (<1000) ● Allomorphy: vowel harmony ● Looping: *ynawemaylmyylmyylmyylmy-ylmyylmyymaya mawemyymamya How well do they generalize? Syncretism Test: all the TAM categories exhibit syncretism across the second and third-person singular actor. Exception: The past perfective slot (where they take different forms) Not observing the past perfective forms, systems tend to predict the forms as syncretic (generalizing from observed slots), resulting in the misprediction of the actual forms (exceptions)
  • 42. Using machine learning to evaluate our data quality
  • 43. Experiment data 15 25 35 22 What are the most “challenging” languages? ● Syc: Syriac (train: 1217 rows/534 lemmas) ● Ckt: Chukchi (132 rows/118 lemmas) ● Itl: Itelmen (1246 rows/849 lemmas) ● Gup: Kunwinjku (214 rows/62 lemmas) ● Bra: Braj (1082 rows/825 lemmas) ● Ail: Eibela (918 rows/470 lemmas) ● Evn: Evenki (5216 rows/2919 lemmas) ● Sjo: Xibe (290 rows/206 lemmas) ST 2021
  • 44. Experiment data 15 25 35 22 Evenki : ● The dataset has been created from oral speech samples ● Little attempt at any standardization in the oral speech transcription ● The Evenki language is known to have rich dialectal variation ● The vowel harmony is complex: not all suffixes obey it, and it is also dialect-dependent Schema: ● Various past tense forms are all annotated as PST ● Several comitative suffixes all annotated as COM ● Some features are present in the word form but they receive no annotation at all Elena Klyachko ST 2021
  • 45. Experiment data 15 25 35 22 Chukchi (Luorawetlat): ● The dataset has been created from oral speech samples ● Very small and sparse dataset ● Polypersonal agreement ● very productive incorporation English analogue: I am bookreading/dishwashing/fruitcutting. ● Complex vowel harmony: Morphemes: "plus" with [a], [o], [e], [ə] and "minus" with [i], [e], [u], [ə] If a word with minus morphemes gets an affix with plus morphemes, all the minus ones turn into plus: [i] -> [e], [e] -> [a], [u]--> [o]. Нутэнут + N;DAT → нотагты Schema: ● No attempt to split genderlects «walrus»: male рыркы (rərkə) — female цыццы tsəttsə Elena Klyachko, Maria Ryskina ST 2021
  • 46. Experiment data 15 25 35 22 Itelmen: ● Very small and sparse dataset ● Verbs mark both subjects (with prefixes and suffixes) and objects (with suffixes); hard to capture combinations: Gold Tag Predicted čʼelisčiŋnen V;NO3S+AB3S;FIN;IND;PST čʼelnen enčʼiɬivnen V;NO3S+AB3S;FIN;IND;PRS;CAUS ənčʼiɬiznen nəntxlakninˀin V;NO3P+AB3S;FIN;FOC;IND;PST nəntxlakisxenˀin ● Complex phono- and morphotactics ktavolknan V.CVB;NFIN ktavolˀin ktekejknen V.CVB;NFIN ktekejˀin Karina Sheifer, Sofya Ganieva, Matvey Plugaryov ST 2021
  • 47. Experiment data 15 25 35 22 Kunwinjku ● Very small and sparse dataset ● Complex phonotactics: borlbme+V;2;PL;Non-PST → *ngurriborlbme (instead of ngurriborle) ● Falling into loops: *ngarrrrrrrrrrrrrrmbbbijj (should be karribelbmerrinj) or *ngadjarridarrkddrrdddrrmerri (should be karriyawoyhdjarrkbidyikarrmerrimeninj) William Lane, Steven Bird ST 2021
  • 48. Experiment data 15 25 35 22 Eibela ● Very small and sparse dataset (comes from inter-linear texts) ● Complex features, many of which didn’t have accurate UniMorph analogues ASS (associated place) → Approximative APPRX ASS.EV (associated event) → Intransitive INTR A.NOM (Agent nominalization) → Noun N CONTINUOUS/PROG →Progressive PROG COORD (Coordinator) → Comitative COM DEL.IMP: DEL (Delayed imperative → Imperative-Jussive IMP ; Future FUT DEO.DUR (deontic.durative) →Obligative OBLIG ; durative DUR D.S (different subject) →DS HYPO (hypothetical) → Irrealis IRR PERFECT/RESULTATIVE →Perfect PRF PROHIB → IMP;NEG PROGRESSIVE → PROG SIM/SIM, S.S (simultaneous) → Simultaneous Multiclausal Aspect SIMMA ST.IMP (Strong imperative) → Imperative-Jussive Mood ; Remote REMT TEL.EV (Telic) → Telic TEL UNCERT (Uncertain) → Potential POT ● Misprediction of vowel length: to:mulu+N;ERG -->tomulε (should be of to:mu:lε:) Grant Aiton ST 2021
  • 49. Which language data can be used for typological studies (e.g., compare morphological complexity)?
  • 50. Approx. amount of data for each ST2021 language
  • 51. Approx. amount of data for each ST2021 language Polysynthetic languages represented by a few samples A few samples from glosses; No full paradigms Mainly full paradigms Chukchi data contains: Only 3 out 9 gram. cases; 3 out of 6 tenses
  • 52. Approx. amount of data for each ST2021 language Polysynthetic languages represented by a few samples A few samples from glosses; No full paradigms Mainly full paradigms Chukchi data contains: Only 3 out 9 gram. cases; 3 out of 6 tenses Check how representative the dataset is before you use it for typological studies!
  • 54. Paradigms in polysynthetic languages Currently we have very limited representation of paradigms in polysynthetic languages Not clear what a potential paradigm should contain
  • 55. Derivation – Inflection continuum No strict boundary between inflection and derivation Derivational morphology also represents systematicity and regularity (English agentive “-er”, absence “-less”, ability “-able”) No clear dimensions of derivational meanings
  • 56. Morphology and Syntax Should we include clitics, copulas, MWEs? If we do, the size of paradigm tables grows exponentially What might be an efficient way to store them?
  • 57. Thank you! Join and follow us: https://twitter.com/unimorph_/