SlideShare a Scribd company logo
1 of 79
Download to read offline
Language Technologies
What endangered languages can teach us
Dafydd Gibbon
U Bielefeld
ELKL-4, Agra, 2016-02-
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 2/79
Focus: role reversal
Not just:
What can the Human Language Technologies
offer to endangered (and other) languages?
But:
What can Human Language Engineering learn from endangered
languages?
Or:
What do endangered languages require from the Human
Language Technologies?
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 3/79
Applying technologies – a reminder
A traditional
‘waterfall model’ of
software development
Requirements
specification:
systems analysis
Product dissemination:
Reliability, sustainability,
interoperability, updating
Implementation:
Programming
methods, styles
Design:
Modules, algorithms,
data structures
Evaluation:
Black box, glass box,
field testing
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 4/79
Applying technologies – a reminder
Requirements
specification:
systems analysis
Product dissemination:
Reliability, sustainability,
interoperability, updating
Implementation:
Programming
methods, styles
Design:
Modules, algorithms,
data structures
Evaluation:
Black box, glass box,
field testing
Feedback and
revision cycle
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 5/79
Summary
● Reversed roles:
– What can endangered languages teach lang tech?
● Background to Language Endangerment
● Technology roles:
– Documentary technologies
– Enabling technologies
– Productivity technologies
● Enabling technologies
– Annotation and annotation mining
● How do languages differ in speech rhythm?
– How similar / different are languages?
● Language differences/distances among languages of Ivory Coast
● Endangered languages:
– What do science and technology (human knowledge in general)
lose when languages die?
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 6/79
Language Technologies
● Engineering:
– speech: ASR, TTS, speaker id / recognition, ...
– language: NLP, NL parsing, Q&A, text mining, text
classification, lexicon and grammar induction, machine
translation …
– multimodal: speech I/0 (dictation, process control, speech
computer UI), speech avatars (Siri, Cortana), gesture
(touchpad, waving), biometric systems
● Computational linguistics & Computer Science:
– domain models of natural language syntax, semantics,
pragmatics, language typology and genesis
– formalisms and algorithms for induction, parsing,
generation of language
– corpus analysis for lexicon and grammar induction
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 7/79
Language Technologies
● Engineering:
– speech: ASR, TTS, speaker id / recognition, ...
– language: NLP, NL parsing, Q&A, text mining, text
classification, lexicon and grammar induction, machine
translation …
– multimodal: speech I/0 (dictation, process control, speech
computer UI), speech avatars (Siri, Cortana), gesture
(touchpad, waving), biometric systems
● Computational linguistics & Computer Science:
– domain models of natural language syntax, semantics,
pragmatics, language typology and genesis
– formalisms and algorithms for induction, parsing,
generation of language
– corpus analysis for lexicon and grammar induction
OVERLAP
(also with linguistics)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 8/79
The broader interdisciplinary context
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 9/79
Doc Ling and Digital Humanities
Humanities:
Theology, Philology, Literary
Studies, Linguistics, Law, ...Computation
Computational Linguistics
Mathematical
Computational
Linguistics
NLP, Computational
Corpus Linguistics
Digital Humanities
Speech Technologies
Documentary
Linguistics
?
Timeline
1950
Humanities
Computing
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 10/79
Doc Ling and Digital Humanities
Humanities:
Theology, Philology, Literary
Studies, Linguistics, Law, ...Computation
Computational Linguistics
Mathematical
Computational
Linguistics
NLP, Computational
Corpus Linguistics
Digital Humanities
Speech Technologies
Documentary
Linguistics
?
Timeline
1950
Humanities
ComputingSGML
TEI
(HTML)
XML
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 11/79
Roles for computational technologies
1. Documentation technologies
2. Enabling technologies
3. Productivity technologies
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 12/79
1. Documentation Technologies
● Project planning tools
● Data collection tools
– scenario support for
● elicitation
● recording (multimodal)
● metadata collection
– document scanning
● Data archiving and access
– database and search model
● relational, object-oriented, ...
● Multilinear annotation
– for search, re-use analysis, application:
● sharable (sustainable, interoperable) standards
● annotation categories for phonetics, grammar, discourse, …
● semi-automatic annotation methods
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 13/79
2. Enabling Technologies
● Resource construction tools
● phonetic analysis
● lexicon induction from data
– word lists
– word frequency lists (and other word statistics)
– concordances
– collocations
● grammar induction from data
– Part of Speech (POS) tagging
– grammar induction
– parsing and generation
● translation
– multilingual dictionaries
– terminologies
– processing of parallel or comparable texts
– translator’s workbench
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 14/79
3. Productivity Technologies
● Recognition
– Automatic Speech Recognition (ASR)
– Visual scene and object recognition
– Information retrieval from text
● Identification
– Speaker identification
– Language identfication
– Authorship attribution
● Generation
– Text-to-Speech Synthesis (TTS)
– Written text geneneration from databases
● Products
– Dictation and information applications
– Translation applications
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 15/79
Endangered languages as teachers: some topics
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 16/79
Endangered languages as teachers: some topics
Insights into and results concerning
– diversity vs. normalization of speech and language
– the intricacy, complexity of speech, language and languages
– similarities and differences between languages
(typology, history, dispersion of language)
– scenario-dependent properties of languages
– gender, age, social role
– task orientation
– public vs. informal vs. intimate styles
– diversity of expression of emotion
– political status of languages wrt dominance, minorities
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 17/79
Endangered languages as teachers: some outlets
● Speech Assessment Methodologies (EC Project)
● EAGLES: Expert Advisory Groups for Language
Engineering Standards
● Many other resources oriented European Projects,
including
– MATE
– IMDI
– ...
● LREC – Language Resources and Evaluation Conference
● Language Resources Map
– https://en.wikipedia.org/wiki/LRE_Map
● Krauwer’s BLARK: Basic Language Resource Kit
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 18/79
Endangered languages as teachers: some models
● Steven Krauwer’s BLARK:
– Goal of equal status of European languages
– Generalisable to the world at large?
– Basic Language Resource Kit initial specification
● written language corpora
● spoken language corpora
● mono- and bilingual dictionaries
● terminology collections
● grammars
● modules (e.g. taggers, morphological analysers, parsers, speech
recognisers, text-to-speech)
● annotation standards and tools
● corpus exploration and exploitation tools
● bilingual corpora
● etc
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 19/79
Some models: Basic Language Resource Kit (BLARK)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 20/79
Some models: Basic Language Resource Kit (BLARK)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 21/79
Many more topics …
● A closer look at language typology reveals many more
requirements for the technogies, for example:
– concurrent data and parallel processing:
● phonological tone
● morphological tone vs. pitch accent vs. stress; prosodic negation
● grammatical stress/accent, focus, intonation, word order
● gesture, sign language, discourse participants
– language similarity for priority setting in language selection
for funding (usually: chance) and for system adaptation:
● role of language typology:
– similarity/difference in speech sound systems
– similarity/difference in grammar
– similarity/difference in the lexicon
● similarity/difference in discourse conventions
● similarity/difference in general cultural conventions
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 22/79
Comparison of DocLing and LangTech scenarios
● The documentary linguistic scenarios are:
– rather individual, extremely heterogeneous
– rather hard to define and delimit
– de facto standards: Praat, ELAN, Wordsmith, TypeCraft,...
– somewhat ad hoc - ‘what you can get’
● Language, speech, multimodal technology scenarios are:
– highly standardised, rather coherent
– tendentially easy to define and delimit
– very application / product oriented
● especially in speech technology: highly product specific
● text technology is more generic
– regulated standards:
● statistical evaluation procedures
● institutional standards (e.g. ISO)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 23/79
Relations between the technologies: a simplified ‘workflow’
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 24/79
Reversed roles: what endangered languages offer to engineering
Endangered
(and other)
language
scenarios
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 25/79
Reversed roles: what endangered languages offer to engineering
Endangered
(and other)
language
scenarios
Formal and
computational
models
Interview,
corpus,
experimental
phonetic and
psycholinguistic
methods
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 26/79
Reversed roles: what endangered languages offer to engineering
Modern
software
supported
descriptive
and
experimental
fieldwork
Automatic
analysis &
machine
learning
Endangered
(and other)
language
scenarios
Formal and
computational
models
Interview,
corpus,
experimental
phonetic and
psycholinguistic
methods
Lexicons
Grammars
Text
grammars
Discourse
models
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 27/79
Reversed roles: what endangered languages offer to engineering
Modern
software
supported
descriptive
and
experimental
fieldwork
Automatic
analysis &
machine
learning
Endangered
(and other)
language
scenarios
Formal and
computational
models
Interview,
corpus,
experimental
phonetic and
psycholinguistic
methods
Lexicons
Grammars
Text
grammars
Discourse
models
Repositories
MPI (DoBeS)
SOAS (HRELP)
PARADISEC
LDC
ELRA/ELDA
… … ...
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 28/79
Reversed roles: what endangered languages offer to engineering
Applications
Literature
Media
Teaching
HLT
Modern
software
supported
descriptive
and
experimental
fieldwork
Automatic
analysis &
machine
learning
Endangered
(and other)
language
scenarios
Formal and
computational
models
Interview,
corpus,
experimental
phonetic and
psycholinguistic
methods
Lexicons
Grammars
Text
grammars
Discourse
models
Repositories
MPI (DoBeS)
SOAS (HRELP)
PARADISEC
LDC
ELRA/ELDA
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 29/79
What I will be looking at – a modest selection
Modern
software
supported
descriptive
and
experimental
fieldwork
Automatic
analysis &
machine
learning
Endangered
(and other)
language
scenarios
Formal and
computational
models
Interview,
corpus,
experimental
phonetic and
psycholinguistic
methods
Prosodic
Analysis
Tone
Intonation
Timing
Rhythm
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 30/79
What I will be looking at – a modest selection
Modern
software
supported
descriptive
and
experimental
fieldwork
Automatic
analysis &
machine
learning
Endangered
(and other)
language
scenarios
Formal and
computational
models
Interview,
corpus,
experimental
phonetic and
psycholinguistic
methods
Prosodic
Analysis
Tone
Intonation
Timing
Rhythm
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 31/79
Background to Language Endangerment:
A Global Perspective
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 32/79
Ethnologue metadata repository of all known languages
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 33/79
Background: Global Perspective on language endangerment
● Globalization
– travel, trade, economy, politics
→ language dominance
Trade languages, pidgins:
Greek → Latin → Arabic → Portuguese → Spanish → French → English
→ religion, culture
– crises
● Example: climate deterioration (Africa, West Asia)
→ lack of water → famine → poverty → hunger → disease
→ migration → diaspora
● “Clash of Civilizations” (Samuel Huntington)
→ conflicts: both local and global
● The role of language stereotypes, discrimination?
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 34/79
Background: Global Perspective on language endangerment
● Globalization
– travel, trade, economy, politics
→ language dominance
Trade languages, pidgins:
Greek → Latin → Arabic → Portuguese → Spanish → French → English
→ religion, culture
– crises
● Example: climate deterioration (Africa, West Asia)
→ lack of water → famine → poverty → hunger → disease
→ migration
● “Clash of Civilizations” (Samuel Huntington)
→ conflicts: both local and global
● The role of language stereotypes, discrimination?
The ‘explanation’ for language endangerment as
LOSS OF INTERGENERATIONAL
TRANSMISSON OF LANGUAGES, CULTURE,
RELIGION, SKILLS, …
is too simple.
It begs the question:
Why is there loss of intergenerational
transmission?
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 35/79
Clash of Civilizations? World religions
How far is
technology
relevant?
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 36/79
Clash of Civilizations? World language families
How far can
technology go?
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 37/79
ISO 639-3 Language Code Standard – 5 language types
● Living languages: A language is listed as living when there are people still
living who learned it as a first language. Based on intelligibility not politics.
● Extinct languages: no longer living, gone extinct in recent times. (e.g. in
the last few centuries), distinct languages identified based on intelligibility
(as defined for individual languages).
● Ancient languages: If it went extinct in ancient times (> 1000 years), must
have had a distinct literature and be treated distinctly by the scholarly
community, must have an attested literature or be well-documented as a
language known to have been spoken by some particular community at
some point in history, may not be areconstructed language inferred from
historical-comparative analysis.
● Historic languages: Must be distinct from any modern languages that are
descended from it; for instance, Old English and Middle English. Here, too,
the criterion is that the language have a literature that is treated distinctly
by the scholarly community.
● Constructed languages: Must have a literature and it must be designed
for the purpose of human communication. Specifically excluded are
reconstructed languages and computer programming languages.
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 38/79
UNESCO chart of endangered languages
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 39/79
UNESCO map of endangered languages
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 40/79
Endangered languages - ‘in danger of disappearing’
● Endangered languages:
– Languages in danger of disappearing
– UNESCO
– The language birth and death cycle
● Why and how do new languages appear / disappear?
● Why ‘danger’?
– The economics of language endangerment
– The politics of language endangerment
– Social consequences of language endangerment
● What can the Human Language Technologies do?
– Documentation Technologies?
– Enabling Technologies?
– Productivity Technologies?
● Which languages can be handled by HLT?
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 41/79
Enabling Technologies
Annotation and Annotation Mining
Language Similarity Analysis
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 42/79
Enabling technologies
● Annotation (preferably automatic)
– associating text labels and time-stamps with speech recordings
● Annotation mining (preferably automatic)
– information extraction from annotations:
● text label list, text label frequencies, text label duration statistics
● visualisation of text label duration patterns, rhythm patterns
● Similarity analysis – virtual distance mapping
– Which languages have been (almost) documented?
– Which languages are likely to respond to adaptive techniques for
modelling a language starting with a known model for another
language.
● Similarity feature definition
– Geographic (areal contact)
– Typological (similarity of paradigmatic and syntagmatic structures)
– Genealogical (history of language families)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 43/79
Enabling technologies
Annotation and Annotation Mining
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 44/79
Why annotation? And how?
● The primary reason for the annotation of text and speech
data is to enable efficient search procedures
– to provide structure
– in order to make data systematically searchable
– by assigning perceptual/hermeneutic categories to data
– for the purposes of
● finding archived media
● linguistic and phonetic analysis
● development of speech and language systems
● Searching unstructured data is difficult
– but improving with the help of machine learning
● example: search on free text data (e.g. Google search as a machine
for on-the-fly concordance construction from web data)
● example: Google’s image search
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 45/79
Why annotation? And how?
● The primary reason for the annotation of text and speech
data is to enable efficient search procedures
– to provide structure
– in order to make data systematically searchable
– by assigning perceptual/hermeneutic categories to data
– for the purposes of
● finding archived media
● linguistic and phonetic analysis
● development of speech and language systems
● Searching unstructured data is difficult
– but improving with the help of machine learning
● example: search on free text data (e.g. Google search as a machine
for on-the-fly concordance construction from web data)
● example: Google’s image search
ALL THESE PROCEDURES
ARE VERY PRECISELY DEFINABLE
AND THEREFORE CAN BE (AND ARE OFTEN)
AUTOMATISED WITH APPROPRIATE SOFTWARE
(POSTAGGERS, CONCORDANCERS,
AUTOMATIC SPEECH ANNOTATORS, ...)
THIS IS STANDARD PRACTICE
IN THE HUMAN LANGUAGE TECHNOLOGIES (HLT)
BUT NOT (YET) IN DOCUMENTARY LINGUISTICS
WHEN THIS HAPPENS, THE EFFICIENCY
OF MANY ASPECTS OF DL
WILL BE REVOLUTIONLSED
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 46/79
Speech annotation example: graphical display with Praat
(Niger-Congo>Gur>Tem, ISO 639-3 kdh)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 47/79
Annotation mining: the syllable timing space
● A standard meme in speech timing descriptions is that the
spoken forms of languages fall into 3 categories:
– stress (or foot) timing
– syllable timing
– mora timing
● What kind of speech rhythm – 1/1, 2/4, 3/4, 4/4 ... ?
– Embarrassing problem: physical correlates are elusive
– Measures of relative regularity or irregularity of timing
relative to the syllable or the foot - global numbers, no
detailed information
– Example:
● normalised Pairwise Variability Index (nPVI):
● average duration differences between neighbouring items
● English: nPVI = 70 (mean=186ms, SD= 127ms, coeffvar=69%)
● Mandarin: nPVI = 40 (mean=174ms, SD= 59ms, coeffvar=34%)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 48/79
Automatic Annotation Mining: looking at the details
(global descriptive statistics for British English recordings)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 49/79
Automatic Annotation Mining: looking at the details
(KWIC concordance of text labels in context, with statistics)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 50/79
Automatic Annotation Mining: looking at the details
(Wagner Quadrant Graphs for Mandarin, Tem and English)
Duration relations between
neighbouring syllables:
• durationn x durationn+1
• note the distribution:
• regular?
• random?
• clustering?
• position?
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 51/79
Automatic Annotation Mining: looking at the details
(rhythm spectrum graphs for Mandarin, Tem and English)
Numbers of duration maxima
(peaks) at different item
distances:
x = peak count
y = peak distance
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 52/79
Automatic Annotation Mining: looking at the details
(stylised duration patterns – the reality of rhythm)
Mandarin
Tem
Peak sequences at
distances 1,...,3
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 53/79
Automatic Annotation Mining: looking at the details
(stylised duration patterns – the reality of rhythm)
Mandarin
English
Peak sequences at
distances 1,...,3
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 54/79
Automatic Annotation mining: tone in Mandarin and Tem
Tem
(no difference)
Mandarin
(difference for Tone 5, neutral tone)
Significance: natural TTS, e.g. reading functionality for the blind, voice
information services (e.g. Apple Siri, Microsoft Cortana, Google Now)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 55/79
Automatic Annotation mining: tone/stress in Mandarin, English
English
(gradient difference tendency)
Mandarin
(difference for Tone 5, neutral tone)
Significance: natural TTS, e.g. reading functionality for the blind, voice
information services (e.g. Apple Siri, Microsoft Cortana, Google Now)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 56/79
Time Group Analysis
TGA online annotation mining tool
http://wwwhomes.uni-bielefeld.de/gibbon/TGA/
http://localhost/TGA/
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 57/79
Annotation Mining: rhythm visualisation
TGA online tool: visualisation of syllable time relations
http://wwwhomes.uni-bielefeld.de/gibbon/TGA/
Duration difference tokens:
Keyboard friendly transcription:
2D visualisation of durations:
Syllable durations:
Duration difference tokens for utterance #37:
– pos, neg, equal differences between neighbours: /, , =
– difference threshold: 40ms.
– clear indication of syllable isochrony
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 58/79
Annotation Mining: rhythm visualisation
Compare with English pattern
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 59/79
Enabling Technologies
Modelling tones: Niger-Congo > Gur > Tem (Togo)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 60/79
Annotation Mining: tone automaton induction for TTS
Resources:
– Zakari Tchagbale: “Exercices et corrigés” (1984)
– Dafydd Gibbon: finite state model of tone sandhi (1987)
– Zakari Tchagbale: recording (2012)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 61/79
Annotation Mining: tone automaton induction for TTS
Automatic
downstep
Startup
effect
Automatic
upstep
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 62/79
Annotation Mining: tone automaton induction for TTS
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 63/79
Annotation Mining: tone automaton induction for TTS
Tone co-occurrences
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 64/79
Case 2: Allotones: Tem (Gur; ISO 639-2 kth) resources
H; #h; high
startup
H
semi-
terrace
L
semi-
terrace
end
start
L; #l; low
startup
L; l; downdriftH; h; upsweep L; ^l; upstep
H; h#;
>baseline
L; l#;
baseline
H; !h; downstep
end
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 65/79
Case 2: Allotones: Tem (Gur; ISO 639-2 kth) resources
H; #h; high
startup
H
semi-
terrace
L
semi-
terrace
end
start
L; #l; low
startup
L; l; downdriftH; h; upsweep L; ^l; upstep
H; h#;
>baseline
L; l#;
baseline
H; !h; downstep
end
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 66/79
Enabling technologies
Language similarity and difference
as virtual distance
Towards the use of Machine Learning
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 67/79
Enabling technologies: prioritisation
● Importance of determining language similarities and
dissimilarities:
– re-use of existing materials in different communities
● beware of problems (e.g. Ibibio / Efik)
– adaptation of existing productivity technologies
– prioritisation of documentation funding
– ...
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 68/79
How similar are languages? - A little similar, but not very similar
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 69/79
Selecting features for similarity distances
● Lexical
– Swadesh word list
– West African Lexical Dictionary Set (WALDS)
– …
● Phonetic / phonological
– vowel set comparison
– consonant set comparison (more stable – used for many
traditional initial classifications, e.g. of Indo-European
languages)
● Grammatical features
– World Atlas of Language Structures (WALS)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 70/79
Kru languages – Ivory Coast: Method – consonant systems
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 71/79
Kru languages – Ivory Coast: Method – consonant systems
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 72/79
Kru languages – Ivory Coast: Feature – consonant systems
Method:
1. compare all consonant systems pairwise:
Levenshtein Edit Distance / Hamming Distance
2. visualise differences as ‘virtual distances’ in a chart
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 73/79
Kru languages – Ivory Coast: Feature – consonant systems
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 74/79
Kru languages – Ivory Coast: Feature – consonant systems
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 75/79
Which features are most useful? - Help from Machine Learning
(Decision Tree Induction)
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 76/79
Conclusion
What do we lose if we lose languages?
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 77/79
What do we lose if we lose languages?
● Why is this an issue? What danger?!
– Imagine what we lose if a language disappears...
– Why not just use one language worldwide?
– Remember what language is for!
● Imagine what we lose if a language disappears:
– Structure: understanding of the architecture of the human mind
– each language has a unique combination of cognitive patterns
– Language semantics: a language is a repository of knowledge
about the world – skills, health
– Pragmatics: narratives of history and identity of a language
community, law, literature (with music and art), religion, ...
● Why not just use one language worldwide?
– Lack of interregional communication means that the shared
language will develop local differences anyway ...
– It’s boring ...
–
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 78/79
Language is complex: Ranks, Interpretations, Languages, Context
(MORPHO)PHONEME
MORPHEME
LEXICAL ROOT
DERIVED WORD
COMPOUND WORD
PHRASE
CLAUSE
SENTENCE
TEXT
LEXICON–partialregularity,holisticopacity
DIALOGUE
Syntagmatic properties
Hypostatic
properties
in different
modalities:
speech
text
gesture
Paradigmatic
properties
Grammar – compositionality
PROSODIC
HIERARCHIES
structural opacity
SEMANTICS/PRAGMATICS
CONCEPTS, OBJECTS, EVENTS
structural opacity
ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 79/79
Diolch yn fawr!

More Related Content

What's hot

Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"Pascual Pérez-Paredes
 
Role of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languagesRole of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languagesDr. Amit Kumar Jha
 
Role of Language Engineering to Preserve Endangered Language
Role of Language Engineering to Preserve Endangered Language Role of Language Engineering to Preserve Endangered Language
Role of Language Engineering to Preserve Endangered Language Dr. Amit Kumar Jha
 
Computational linguistics
Computational linguistics Computational linguistics
Computational linguistics kashmasardar
 
Research data as an aid in teaching technical competence in subtitling
Research data as an aid in teaching technical competence in subtitlingResearch data as an aid in teaching technical competence in subtitling
Research data as an aid in teaching technical competence in subtitlingUniversity of Warsaw
 
Glis Localization Internationalization 05 20071030
Glis Localization Internationalization 05 20071030Glis Localization Internationalization 05 20071030
Glis Localization Internationalization 05 20071030Jan Pawlowski
 
1 computational linguistics an introduction
1 computational linguistics   an introduction1 computational linguistics   an introduction
1 computational linguistics an introductionThennarasuSakkan
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
Tracking Learning: Using Corpus Linguistics to Assess Language Development
Tracking Learning: Using Corpus Linguistics to Assess Language DevelopmentTracking Learning: Using Corpus Linguistics to Assess Language Development
Tracking Learning: Using Corpus Linguistics to Assess Language DevelopmentCALPER
 
Technological approaches to linguistic documentation and meta-documentation
Technological approaches to linguistic documentation and meta-documentationTechnological approaches to linguistic documentation and meta-documentation
Technological approaches to linguistic documentation and meta-documentationPankaj Dwivedi
 
Developing Teaching Materials with Authentic Data and Corpus Analysis Tools
Developing Teaching Materials with Authentic Data and Corpus Analysis ToolsDeveloping Teaching Materials with Authentic Data and Corpus Analysis Tools
Developing Teaching Materials with Authentic Data and Corpus Analysis ToolsCALPER
 
Script to Sentiment : on future of Language TechnologyMysore latest
Script to Sentiment : on future of Language TechnologyMysore latestScript to Sentiment : on future of Language TechnologyMysore latest
Script to Sentiment : on future of Language TechnologyMysore latestJaganadh Gopinadhan
 
Design and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic TextsDesign and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic TextsIJCSIS Research Publications
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsAdnanBaloch15
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics1101989
 
Learning through Listening towards Advanced Japanese
Learning through Listening towards Advanced JapaneseLearning through Listening towards Advanced Japanese
Learning through Listening towards Advanced JapaneseCALPER
 
Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Jaganadh Gopinadhan
 
Technology in the ASL/English Bilingual Classroom
Technology in the ASL/English Bilingual ClassroomTechnology in the ASL/English Bilingual Classroom
Technology in the ASL/English Bilingual Classroomrosemary.stifter
 

What's hot (20)

Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"Developing corpus-based resources for language learning: looking back in "hope"
Developing corpus-based resources for language learning: looking back in "hope"
 
Role of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languagesRole of language engineering to preserve endangered languages
Role of language engineering to preserve endangered languages
 
Role of Language Engineering to Preserve Endangered Language
Role of Language Engineering to Preserve Endangered Language Role of Language Engineering to Preserve Endangered Language
Role of Language Engineering to Preserve Endangered Language
 
Computational linguistics
Computational linguistics Computational linguistics
Computational linguistics
 
The Dual Language Professional NABE 2006
The Dual Language Professional NABE 2006The Dual Language Professional NABE 2006
The Dual Language Professional NABE 2006
 
Research data as an aid in teaching technical competence in subtitling
Research data as an aid in teaching technical competence in subtitlingResearch data as an aid in teaching technical competence in subtitling
Research data as an aid in teaching technical competence in subtitling
 
Glis Localization Internationalization 05 20071030
Glis Localization Internationalization 05 20071030Glis Localization Internationalization 05 20071030
Glis Localization Internationalization 05 20071030
 
1 computational linguistics an introduction
1 computational linguistics   an introduction1 computational linguistics   an introduction
1 computational linguistics an introduction
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Tracking Learning: Using Corpus Linguistics to Assess Language Development
Tracking Learning: Using Corpus Linguistics to Assess Language DevelopmentTracking Learning: Using Corpus Linguistics to Assess Language Development
Tracking Learning: Using Corpus Linguistics to Assess Language Development
 
Technological approaches to linguistic documentation and meta-documentation
Technological approaches to linguistic documentation and meta-documentationTechnological approaches to linguistic documentation and meta-documentation
Technological approaches to linguistic documentation and meta-documentation
 
Developing Teaching Materials with Authentic Data and Corpus Analysis Tools
Developing Teaching Materials with Authentic Data and Corpus Analysis ToolsDeveloping Teaching Materials with Authentic Data and Corpus Analysis Tools
Developing Teaching Materials with Authentic Data and Corpus Analysis Tools
 
Script to Sentiment : on future of Language TechnologyMysore latest
Script to Sentiment : on future of Language TechnologyMysore latestScript to Sentiment : on future of Language TechnologyMysore latest
Script to Sentiment : on future of Language TechnologyMysore latest
 
Design and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic TextsDesign and Implementation of a Language Assistant for English – Arabic Texts
Design and Implementation of a Language Assistant for English – Arabic Texts
 
Com ling
Com lingCom ling
Com ling
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Learning through Listening towards Advanced Japanese
Learning through Listening towards Advanced JapaneseLearning through Listening towards Advanced Japanese
Learning through Listening towards Advanced Japanese
 
Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic Sanskrit and Computational Linguistic
Sanskrit and Computational Linguistic
 
Technology in the ASL/English Bilingual Classroom
Technology in the ASL/English Bilingual ClassroomTechnology in the ASL/English Bilingual Classroom
Technology in the ASL/English Bilingual Classroom
 

Similar to ELKL 4, Language Technology: learning from endangered languages

Huy & robert a call to call
Huy & robert a call to callHuy & robert a call to call
Huy & robert a call to callPhung Huy
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
 
Elin005 st
Elin005 stElin005 st
Elin005 stirinae
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Guy De Pauw
 
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdfApplied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdfDr.Badriya Al Mamari
 
Plurilingualism and intercomprehension teaching forum 2012
Plurilingualism and intercomprehension teaching forum 2012Plurilingualism and intercomprehension teaching forum 2012
Plurilingualism and intercomprehension teaching forum 2012Tita Beaven
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technologytechiaith
 
Natural language processing for Albanian: a state-of-the-art survey
Natural language processing for Albanian: a state-of-the-art  surveyNatural language processing for Albanian: a state-of-the-art  survey
Natural language processing for Albanian: a state-of-the-art surveyIJECEIAES
 
Reflections on building a Multi-country AAC Implementation Guide.pptx
Reflections on building a Multi-country AAC Implementation Guide.pptxReflections on building a Multi-country AAC Implementation Guide.pptx
Reflections on building a Multi-country AAC Implementation Guide.pptxE.A. Draffan
 
What to Consider for Effective Mobile-Assisted Language Learning: Design Impl...
What to Consider for Effective Mobile-Assisted Language Learning: Design Impl...What to Consider for Effective Mobile-Assisted Language Learning: Design Impl...
What to Consider for Effective Mobile-Assisted Language Learning: Design Impl...heyoungkim
 
CALL (computer Assisted Language)
CALL (computer Assisted Language)CALL (computer Assisted Language)
CALL (computer Assisted Language)syeda12345
 
Calico 2014 intelligent call - def
Calico 2014   intelligent call - defCalico 2014   intelligent call - def
Calico 2014 intelligent call - defPiet Desmet
 
Towards Language-Oriented Modeling (HDR Defense)
Towards Language-Oriented Modeling (HDR Defense)Towards Language-Oriented Modeling (HDR Defense)
Towards Language-Oriented Modeling (HDR Defense)Benoit Combemale
 
universityofcaldas-languagelearningresources-131106095614-phpapp02.pptx
universityofcaldas-languagelearningresources-131106095614-phpapp02.pptxuniversityofcaldas-languagelearningresources-131106095614-phpapp02.pptx
universityofcaldas-languagelearningresources-131106095614-phpapp02.pptxNabaeghaNajam1
 
Exploring Smartphone Applications for Effective Mobile-Assisted Language Lear...
Exploring Smartphone Applications for Effective Mobile-Assisted Language Lear...Exploring Smartphone Applications for Effective Mobile-Assisted Language Lear...
Exploring Smartphone Applications for Effective Mobile-Assisted Language Lear...heyoungkim
 
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptxDataScienceConferenc1
 
Analysing the forum discussion
Analysing the forum discussionAnalysing the forum discussion
Analysing the forum discussionJimmy Castro
 

Similar to ELKL 4, Language Technology: learning from endangered languages (20)

Huy & robert a call to call
Huy & robert a call to callHuy & robert a call to call
Huy & robert a call to call
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 
Elin005 st
Elin005 stElin005 st
Elin005 st
 
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
Human Language Technologies for Ethiopian Languages: Challenges and Future Di...
 
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdfApplied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
Applied Linguistics session 111 0_07_12_2021 Applied linguistics challenges.pdf
 
1.pdf
1.pdf1.pdf
1.pdf
 
Plurilingualism and intercomprehension teaching forum 2012
Plurilingualism and intercomprehension teaching forum 2012Plurilingualism and intercomprehension teaching forum 2012
Plurilingualism and intercomprehension teaching forum 2012
 
Promoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language TechnologyPromoting the Use of Basque via Language Technology
Promoting the Use of Basque via Language Technology
 
Natural language processing for Albanian: a state-of-the-art survey
Natural language processing for Albanian: a state-of-the-art  surveyNatural language processing for Albanian: a state-of-the-art  survey
Natural language processing for Albanian: a state-of-the-art survey
 
Reflections on building a Multi-country AAC Implementation Guide.pptx
Reflections on building a Multi-country AAC Implementation Guide.pptxReflections on building a Multi-country AAC Implementation Guide.pptx
Reflections on building a Multi-country AAC Implementation Guide.pptx
 
What to Consider for Effective Mobile-Assisted Language Learning: Design Impl...
What to Consider for Effective Mobile-Assisted Language Learning: Design Impl...What to Consider for Effective Mobile-Assisted Language Learning: Design Impl...
What to Consider for Effective Mobile-Assisted Language Learning: Design Impl...
 
CALL (computer Assisted Language)
CALL (computer Assisted Language)CALL (computer Assisted Language)
CALL (computer Assisted Language)
 
Calico 2014 intelligent call - def
Calico 2014   intelligent call - defCalico 2014   intelligent call - def
Calico 2014 intelligent call - def
 
Towards Language-Oriented Modeling (HDR Defense)
Towards Language-Oriented Modeling (HDR Defense)Towards Language-Oriented Modeling (HDR Defense)
Towards Language-Oriented Modeling (HDR Defense)
 
universityofcaldas-languagelearningresources-131106095614-phpapp02.pptx
universityofcaldas-languagelearningresources-131106095614-phpapp02.pptxuniversityofcaldas-languagelearningresources-131106095614-phpapp02.pptx
universityofcaldas-languagelearningresources-131106095614-phpapp02.pptx
 
Exploring Smartphone Applications for Effective Mobile-Assisted Language Lear...
Exploring Smartphone Applications for Effective Mobile-Assisted Language Lear...Exploring Smartphone Applications for Effective Mobile-Assisted Language Lear...
Exploring Smartphone Applications for Effective Mobile-Assisted Language Lear...
 
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx
 
Analysing the forum discussion
Analysing the forum discussionAnalysing the forum discussion
Analysing the forum discussion
 
600Desc
600Desc600Desc
600Desc
 
600Desc
600Desc600Desc
600Desc
 

Recently uploaded

Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 

Recently uploaded (20)

Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 

ELKL 4, Language Technology: learning from endangered languages

  • 1. Language Technologies What endangered languages can teach us Dafydd Gibbon U Bielefeld ELKL-4, Agra, 2016-02-
  • 2. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 2/79 Focus: role reversal Not just: What can the Human Language Technologies offer to endangered (and other) languages? But: What can Human Language Engineering learn from endangered languages? Or: What do endangered languages require from the Human Language Technologies?
  • 3. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 3/79 Applying technologies – a reminder A traditional ‘waterfall model’ of software development Requirements specification: systems analysis Product dissemination: Reliability, sustainability, interoperability, updating Implementation: Programming methods, styles Design: Modules, algorithms, data structures Evaluation: Black box, glass box, field testing
  • 4. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 4/79 Applying technologies – a reminder Requirements specification: systems analysis Product dissemination: Reliability, sustainability, interoperability, updating Implementation: Programming methods, styles Design: Modules, algorithms, data structures Evaluation: Black box, glass box, field testing Feedback and revision cycle
  • 5. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 5/79 Summary ● Reversed roles: – What can endangered languages teach lang tech? ● Background to Language Endangerment ● Technology roles: – Documentary technologies – Enabling technologies – Productivity technologies ● Enabling technologies – Annotation and annotation mining ● How do languages differ in speech rhythm? – How similar / different are languages? ● Language differences/distances among languages of Ivory Coast ● Endangered languages: – What do science and technology (human knowledge in general) lose when languages die?
  • 6. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 6/79 Language Technologies ● Engineering: – speech: ASR, TTS, speaker id / recognition, ... – language: NLP, NL parsing, Q&A, text mining, text classification, lexicon and grammar induction, machine translation … – multimodal: speech I/0 (dictation, process control, speech computer UI), speech avatars (Siri, Cortana), gesture (touchpad, waving), biometric systems ● Computational linguistics & Computer Science: – domain models of natural language syntax, semantics, pragmatics, language typology and genesis – formalisms and algorithms for induction, parsing, generation of language – corpus analysis for lexicon and grammar induction
  • 7. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 7/79 Language Technologies ● Engineering: – speech: ASR, TTS, speaker id / recognition, ... – language: NLP, NL parsing, Q&A, text mining, text classification, lexicon and grammar induction, machine translation … – multimodal: speech I/0 (dictation, process control, speech computer UI), speech avatars (Siri, Cortana), gesture (touchpad, waving), biometric systems ● Computational linguistics & Computer Science: – domain models of natural language syntax, semantics, pragmatics, language typology and genesis – formalisms and algorithms for induction, parsing, generation of language – corpus analysis for lexicon and grammar induction OVERLAP (also with linguistics)
  • 8. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 8/79 The broader interdisciplinary context
  • 9. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 9/79 Doc Ling and Digital Humanities Humanities: Theology, Philology, Literary Studies, Linguistics, Law, ...Computation Computational Linguistics Mathematical Computational Linguistics NLP, Computational Corpus Linguistics Digital Humanities Speech Technologies Documentary Linguistics ? Timeline 1950 Humanities Computing
  • 10. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 10/79 Doc Ling and Digital Humanities Humanities: Theology, Philology, Literary Studies, Linguistics, Law, ...Computation Computational Linguistics Mathematical Computational Linguistics NLP, Computational Corpus Linguistics Digital Humanities Speech Technologies Documentary Linguistics ? Timeline 1950 Humanities ComputingSGML TEI (HTML) XML
  • 11. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 11/79 Roles for computational technologies 1. Documentation technologies 2. Enabling technologies 3. Productivity technologies
  • 12. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 12/79 1. Documentation Technologies ● Project planning tools ● Data collection tools – scenario support for ● elicitation ● recording (multimodal) ● metadata collection – document scanning ● Data archiving and access – database and search model ● relational, object-oriented, ... ● Multilinear annotation – for search, re-use analysis, application: ● sharable (sustainable, interoperable) standards ● annotation categories for phonetics, grammar, discourse, … ● semi-automatic annotation methods
  • 13. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 13/79 2. Enabling Technologies ● Resource construction tools ● phonetic analysis ● lexicon induction from data – word lists – word frequency lists (and other word statistics) – concordances – collocations ● grammar induction from data – Part of Speech (POS) tagging – grammar induction – parsing and generation ● translation – multilingual dictionaries – terminologies – processing of parallel or comparable texts – translator’s workbench
  • 14. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 14/79 3. Productivity Technologies ● Recognition – Automatic Speech Recognition (ASR) – Visual scene and object recognition – Information retrieval from text ● Identification – Speaker identification – Language identfication – Authorship attribution ● Generation – Text-to-Speech Synthesis (TTS) – Written text geneneration from databases ● Products – Dictation and information applications – Translation applications
  • 15. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 15/79 Endangered languages as teachers: some topics
  • 16. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 16/79 Endangered languages as teachers: some topics Insights into and results concerning – diversity vs. normalization of speech and language – the intricacy, complexity of speech, language and languages – similarities and differences between languages (typology, history, dispersion of language) – scenario-dependent properties of languages – gender, age, social role – task orientation – public vs. informal vs. intimate styles – diversity of expression of emotion – political status of languages wrt dominance, minorities
  • 17. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 17/79 Endangered languages as teachers: some outlets ● Speech Assessment Methodologies (EC Project) ● EAGLES: Expert Advisory Groups for Language Engineering Standards ● Many other resources oriented European Projects, including – MATE – IMDI – ... ● LREC – Language Resources and Evaluation Conference ● Language Resources Map – https://en.wikipedia.org/wiki/LRE_Map ● Krauwer’s BLARK: Basic Language Resource Kit
  • 18. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 18/79 Endangered languages as teachers: some models ● Steven Krauwer’s BLARK: – Goal of equal status of European languages – Generalisable to the world at large? – Basic Language Resource Kit initial specification ● written language corpora ● spoken language corpora ● mono- and bilingual dictionaries ● terminology collections ● grammars ● modules (e.g. taggers, morphological analysers, parsers, speech recognisers, text-to-speech) ● annotation standards and tools ● corpus exploration and exploitation tools ● bilingual corpora ● etc
  • 19. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 19/79 Some models: Basic Language Resource Kit (BLARK)
  • 20. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 20/79 Some models: Basic Language Resource Kit (BLARK)
  • 21. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 21/79 Many more topics … ● A closer look at language typology reveals many more requirements for the technogies, for example: – concurrent data and parallel processing: ● phonological tone ● morphological tone vs. pitch accent vs. stress; prosodic negation ● grammatical stress/accent, focus, intonation, word order ● gesture, sign language, discourse participants – language similarity for priority setting in language selection for funding (usually: chance) and for system adaptation: ● role of language typology: – similarity/difference in speech sound systems – similarity/difference in grammar – similarity/difference in the lexicon ● similarity/difference in discourse conventions ● similarity/difference in general cultural conventions
  • 22. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 22/79 Comparison of DocLing and LangTech scenarios ● The documentary linguistic scenarios are: – rather individual, extremely heterogeneous – rather hard to define and delimit – de facto standards: Praat, ELAN, Wordsmith, TypeCraft,... – somewhat ad hoc - ‘what you can get’ ● Language, speech, multimodal technology scenarios are: – highly standardised, rather coherent – tendentially easy to define and delimit – very application / product oriented ● especially in speech technology: highly product specific ● text technology is more generic – regulated standards: ● statistical evaluation procedures ● institutional standards (e.g. ISO)
  • 23. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 23/79 Relations between the technologies: a simplified ‘workflow’
  • 24. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 24/79 Reversed roles: what endangered languages offer to engineering Endangered (and other) language scenarios
  • 25. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 25/79 Reversed roles: what endangered languages offer to engineering Endangered (and other) language scenarios Formal and computational models Interview, corpus, experimental phonetic and psycholinguistic methods
  • 26. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 26/79 Reversed roles: what endangered languages offer to engineering Modern software supported descriptive and experimental fieldwork Automatic analysis & machine learning Endangered (and other) language scenarios Formal and computational models Interview, corpus, experimental phonetic and psycholinguistic methods Lexicons Grammars Text grammars Discourse models
  • 27. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 27/79 Reversed roles: what endangered languages offer to engineering Modern software supported descriptive and experimental fieldwork Automatic analysis & machine learning Endangered (and other) language scenarios Formal and computational models Interview, corpus, experimental phonetic and psycholinguistic methods Lexicons Grammars Text grammars Discourse models Repositories MPI (DoBeS) SOAS (HRELP) PARADISEC LDC ELRA/ELDA … … ...
  • 28. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 28/79 Reversed roles: what endangered languages offer to engineering Applications Literature Media Teaching HLT Modern software supported descriptive and experimental fieldwork Automatic analysis & machine learning Endangered (and other) language scenarios Formal and computational models Interview, corpus, experimental phonetic and psycholinguistic methods Lexicons Grammars Text grammars Discourse models Repositories MPI (DoBeS) SOAS (HRELP) PARADISEC LDC ELRA/ELDA
  • 29. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 29/79 What I will be looking at – a modest selection Modern software supported descriptive and experimental fieldwork Automatic analysis & machine learning Endangered (and other) language scenarios Formal and computational models Interview, corpus, experimental phonetic and psycholinguistic methods Prosodic Analysis Tone Intonation Timing Rhythm
  • 30. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 30/79 What I will be looking at – a modest selection Modern software supported descriptive and experimental fieldwork Automatic analysis & machine learning Endangered (and other) language scenarios Formal and computational models Interview, corpus, experimental phonetic and psycholinguistic methods Prosodic Analysis Tone Intonation Timing Rhythm
  • 31. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 31/79 Background to Language Endangerment: A Global Perspective
  • 32. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 32/79 Ethnologue metadata repository of all known languages
  • 33. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 33/79 Background: Global Perspective on language endangerment ● Globalization – travel, trade, economy, politics → language dominance Trade languages, pidgins: Greek → Latin → Arabic → Portuguese → Spanish → French → English → religion, culture – crises ● Example: climate deterioration (Africa, West Asia) → lack of water → famine → poverty → hunger → disease → migration → diaspora ● “Clash of Civilizations” (Samuel Huntington) → conflicts: both local and global ● The role of language stereotypes, discrimination?
  • 34. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 34/79 Background: Global Perspective on language endangerment ● Globalization – travel, trade, economy, politics → language dominance Trade languages, pidgins: Greek → Latin → Arabic → Portuguese → Spanish → French → English → religion, culture – crises ● Example: climate deterioration (Africa, West Asia) → lack of water → famine → poverty → hunger → disease → migration ● “Clash of Civilizations” (Samuel Huntington) → conflicts: both local and global ● The role of language stereotypes, discrimination? The ‘explanation’ for language endangerment as LOSS OF INTERGENERATIONAL TRANSMISSON OF LANGUAGES, CULTURE, RELIGION, SKILLS, … is too simple. It begs the question: Why is there loss of intergenerational transmission?
  • 35. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 35/79 Clash of Civilizations? World religions How far is technology relevant?
  • 36. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 36/79 Clash of Civilizations? World language families How far can technology go?
  • 37. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 37/79 ISO 639-3 Language Code Standard – 5 language types ● Living languages: A language is listed as living when there are people still living who learned it as a first language. Based on intelligibility not politics. ● Extinct languages: no longer living, gone extinct in recent times. (e.g. in the last few centuries), distinct languages identified based on intelligibility (as defined for individual languages). ● Ancient languages: If it went extinct in ancient times (> 1000 years), must have had a distinct literature and be treated distinctly by the scholarly community, must have an attested literature or be well-documented as a language known to have been spoken by some particular community at some point in history, may not be areconstructed language inferred from historical-comparative analysis. ● Historic languages: Must be distinct from any modern languages that are descended from it; for instance, Old English and Middle English. Here, too, the criterion is that the language have a literature that is treated distinctly by the scholarly community. ● Constructed languages: Must have a literature and it must be designed for the purpose of human communication. Specifically excluded are reconstructed languages and computer programming languages.
  • 38. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 38/79 UNESCO chart of endangered languages
  • 39. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 39/79 UNESCO map of endangered languages
  • 40. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 40/79 Endangered languages - ‘in danger of disappearing’ ● Endangered languages: – Languages in danger of disappearing – UNESCO – The language birth and death cycle ● Why and how do new languages appear / disappear? ● Why ‘danger’? – The economics of language endangerment – The politics of language endangerment – Social consequences of language endangerment ● What can the Human Language Technologies do? – Documentation Technologies? – Enabling Technologies? – Productivity Technologies? ● Which languages can be handled by HLT?
  • 41. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 41/79 Enabling Technologies Annotation and Annotation Mining Language Similarity Analysis
  • 42. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 42/79 Enabling technologies ● Annotation (preferably automatic) – associating text labels and time-stamps with speech recordings ● Annotation mining (preferably automatic) – information extraction from annotations: ● text label list, text label frequencies, text label duration statistics ● visualisation of text label duration patterns, rhythm patterns ● Similarity analysis – virtual distance mapping – Which languages have been (almost) documented? – Which languages are likely to respond to adaptive techniques for modelling a language starting with a known model for another language. ● Similarity feature definition – Geographic (areal contact) – Typological (similarity of paradigmatic and syntagmatic structures) – Genealogical (history of language families)
  • 43. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 43/79 Enabling technologies Annotation and Annotation Mining
  • 44. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 44/79 Why annotation? And how? ● The primary reason for the annotation of text and speech data is to enable efficient search procedures – to provide structure – in order to make data systematically searchable – by assigning perceptual/hermeneutic categories to data – for the purposes of ● finding archived media ● linguistic and phonetic analysis ● development of speech and language systems ● Searching unstructured data is difficult – but improving with the help of machine learning ● example: search on free text data (e.g. Google search as a machine for on-the-fly concordance construction from web data) ● example: Google’s image search
  • 45. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 45/79 Why annotation? And how? ● The primary reason for the annotation of text and speech data is to enable efficient search procedures – to provide structure – in order to make data systematically searchable – by assigning perceptual/hermeneutic categories to data – for the purposes of ● finding archived media ● linguistic and phonetic analysis ● development of speech and language systems ● Searching unstructured data is difficult – but improving with the help of machine learning ● example: search on free text data (e.g. Google search as a machine for on-the-fly concordance construction from web data) ● example: Google’s image search ALL THESE PROCEDURES ARE VERY PRECISELY DEFINABLE AND THEREFORE CAN BE (AND ARE OFTEN) AUTOMATISED WITH APPROPRIATE SOFTWARE (POSTAGGERS, CONCORDANCERS, AUTOMATIC SPEECH ANNOTATORS, ...) THIS IS STANDARD PRACTICE IN THE HUMAN LANGUAGE TECHNOLOGIES (HLT) BUT NOT (YET) IN DOCUMENTARY LINGUISTICS WHEN THIS HAPPENS, THE EFFICIENCY OF MANY ASPECTS OF DL WILL BE REVOLUTIONLSED
  • 46. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 46/79 Speech annotation example: graphical display with Praat (Niger-Congo>Gur>Tem, ISO 639-3 kdh)
  • 47. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 47/79 Annotation mining: the syllable timing space ● A standard meme in speech timing descriptions is that the spoken forms of languages fall into 3 categories: – stress (or foot) timing – syllable timing – mora timing ● What kind of speech rhythm – 1/1, 2/4, 3/4, 4/4 ... ? – Embarrassing problem: physical correlates are elusive – Measures of relative regularity or irregularity of timing relative to the syllable or the foot - global numbers, no detailed information – Example: ● normalised Pairwise Variability Index (nPVI): ● average duration differences between neighbouring items ● English: nPVI = 70 (mean=186ms, SD= 127ms, coeffvar=69%) ● Mandarin: nPVI = 40 (mean=174ms, SD= 59ms, coeffvar=34%)
  • 48. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 48/79 Automatic Annotation Mining: looking at the details (global descriptive statistics for British English recordings)
  • 49. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 49/79 Automatic Annotation Mining: looking at the details (KWIC concordance of text labels in context, with statistics)
  • 50. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 50/79 Automatic Annotation Mining: looking at the details (Wagner Quadrant Graphs for Mandarin, Tem and English) Duration relations between neighbouring syllables: • durationn x durationn+1 • note the distribution: • regular? • random? • clustering? • position?
  • 51. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 51/79 Automatic Annotation Mining: looking at the details (rhythm spectrum graphs for Mandarin, Tem and English) Numbers of duration maxima (peaks) at different item distances: x = peak count y = peak distance
  • 52. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 52/79 Automatic Annotation Mining: looking at the details (stylised duration patterns – the reality of rhythm) Mandarin Tem Peak sequences at distances 1,...,3
  • 53. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 53/79 Automatic Annotation Mining: looking at the details (stylised duration patterns – the reality of rhythm) Mandarin English Peak sequences at distances 1,...,3
  • 54. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 54/79 Automatic Annotation mining: tone in Mandarin and Tem Tem (no difference) Mandarin (difference for Tone 5, neutral tone) Significance: natural TTS, e.g. reading functionality for the blind, voice information services (e.g. Apple Siri, Microsoft Cortana, Google Now)
  • 55. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 55/79 Automatic Annotation mining: tone/stress in Mandarin, English English (gradient difference tendency) Mandarin (difference for Tone 5, neutral tone) Significance: natural TTS, e.g. reading functionality for the blind, voice information services (e.g. Apple Siri, Microsoft Cortana, Google Now)
  • 56. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 56/79 Time Group Analysis TGA online annotation mining tool http://wwwhomes.uni-bielefeld.de/gibbon/TGA/ http://localhost/TGA/
  • 57. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 57/79 Annotation Mining: rhythm visualisation TGA online tool: visualisation of syllable time relations http://wwwhomes.uni-bielefeld.de/gibbon/TGA/ Duration difference tokens: Keyboard friendly transcription: 2D visualisation of durations: Syllable durations: Duration difference tokens for utterance #37: – pos, neg, equal differences between neighbours: /, , = – difference threshold: 40ms. – clear indication of syllable isochrony
  • 58. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 58/79 Annotation Mining: rhythm visualisation Compare with English pattern
  • 59. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 59/79 Enabling Technologies Modelling tones: Niger-Congo > Gur > Tem (Togo)
  • 60. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 60/79 Annotation Mining: tone automaton induction for TTS Resources: – Zakari Tchagbale: “Exercices et corrigés” (1984) – Dafydd Gibbon: finite state model of tone sandhi (1987) – Zakari Tchagbale: recording (2012)
  • 61. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 61/79 Annotation Mining: tone automaton induction for TTS Automatic downstep Startup effect Automatic upstep
  • 62. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 62/79 Annotation Mining: tone automaton induction for TTS
  • 63. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 63/79 Annotation Mining: tone automaton induction for TTS Tone co-occurrences
  • 64. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 64/79 Case 2: Allotones: Tem (Gur; ISO 639-2 kth) resources H; #h; high startup H semi- terrace L semi- terrace end start L; #l; low startup L; l; downdriftH; h; upsweep L; ^l; upstep H; h#; >baseline L; l#; baseline H; !h; downstep end
  • 65. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 65/79 Case 2: Allotones: Tem (Gur; ISO 639-2 kth) resources H; #h; high startup H semi- terrace L semi- terrace end start L; #l; low startup L; l; downdriftH; h; upsweep L; ^l; upstep H; h#; >baseline L; l#; baseline H; !h; downstep end
  • 66. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 66/79 Enabling technologies Language similarity and difference as virtual distance Towards the use of Machine Learning
  • 67. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 67/79 Enabling technologies: prioritisation ● Importance of determining language similarities and dissimilarities: – re-use of existing materials in different communities ● beware of problems (e.g. Ibibio / Efik) – adaptation of existing productivity technologies – prioritisation of documentation funding – ...
  • 68. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 68/79 How similar are languages? - A little similar, but not very similar
  • 69. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 69/79 Selecting features for similarity distances ● Lexical – Swadesh word list – West African Lexical Dictionary Set (WALDS) – … ● Phonetic / phonological – vowel set comparison – consonant set comparison (more stable – used for many traditional initial classifications, e.g. of Indo-European languages) ● Grammatical features – World Atlas of Language Structures (WALS)
  • 70. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 70/79 Kru languages – Ivory Coast: Method – consonant systems
  • 71. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 71/79 Kru languages – Ivory Coast: Method – consonant systems
  • 72. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 72/79 Kru languages – Ivory Coast: Feature – consonant systems Method: 1. compare all consonant systems pairwise: Levenshtein Edit Distance / Hamming Distance 2. visualise differences as ‘virtual distances’ in a chart
  • 73. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 73/79 Kru languages – Ivory Coast: Feature – consonant systems
  • 74. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 74/79 Kru languages – Ivory Coast: Feature – consonant systems
  • 75. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 75/79 Which features are most useful? - Help from Machine Learning (Decision Tree Induction)
  • 76. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 76/79 Conclusion What do we lose if we lose languages?
  • 77. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 77/79 What do we lose if we lose languages? ● Why is this an issue? What danger?! – Imagine what we lose if a language disappears... – Why not just use one language worldwide? – Remember what language is for! ● Imagine what we lose if a language disappears: – Structure: understanding of the architecture of the human mind – each language has a unique combination of cognitive patterns – Language semantics: a language is a repository of knowledge about the world – skills, health – Pragmatics: narratives of history and identity of a language community, law, literature (with music and art), religion, ... ● Why not just use one language worldwide? – Lack of interregional communication means that the shared language will develop local differences anyway ... – It’s boring ... –
  • 78. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 78/79 Language is complex: Ranks, Interpretations, Languages, Context (MORPHO)PHONEME MORPHEME LEXICAL ROOT DERIVED WORD COMPOUND WORD PHRASE CLAUSE SENTENCE TEXT LEXICON–partialregularity,holisticopacity DIALOGUE Syntagmatic properties Hypostatic properties in different modalities: speech text gesture Paradigmatic properties Grammar – compositionality PROSODIC HIERARCHIES structural opacity SEMANTICS/PRAGMATICS CONCEPTS, OBJECTS, EVENTS structural opacity
  • 79. ELKL-4, U Agra, 2016-02-25_27 D. Gibbon: What can endangered languages teach the language technologies? 79/79 Diolch yn fawr!