Language is more than just a means of communication. It influences our culture and even our thought processes. During the first four decades of the 20th century, language was viewed by American linguists and anthropologists as being more important than it actually is in shaping our perception of reality. This was mostly due to Edward Sapir and his student Benjamin Whorf who said that language predetermines what we see in the world around us. In other words, language acts like a polarizing lens on a camera in filtering reality--we see the real world only in the categories of our language.
These slides are the relationship between language, culture and thought as Ronald Wardhaugh has discussed in "An Introduction to Sociolinguistics". The examples have been provided from the Pakistani context and culture.
ELKL 4, Language Technology: learning from endangered languagesDafydd Gibbon
Presentation at the ELKL-4 (4th Endangered and Less Resourced Languages) conference, Agra University, India.
Types of language documentation (data and software tools).
V Międzynarodowa Konferencja Naukowa Nauka o informacji (informacja naukowa) w okresie zmian Innowacyjne usługi informacyjne. Wydział Dziennikarstwa, Informacji i Bibliologii Katedra Informatologii, Uniwersytet Warszawski, Warszawa, 15 – 16 maja 2017
Language is more than just a means of communication. It influences our culture and even our thought processes. During the first four decades of the 20th century, language was viewed by American linguists and anthropologists as being more important than it actually is in shaping our perception of reality. This was mostly due to Edward Sapir and his student Benjamin Whorf who said that language predetermines what we see in the world around us. In other words, language acts like a polarizing lens on a camera in filtering reality--we see the real world only in the categories of our language.
These slides are the relationship between language, culture and thought as Ronald Wardhaugh has discussed in "An Introduction to Sociolinguistics". The examples have been provided from the Pakistani context and culture.
ELKL 4, Language Technology: learning from endangered languagesDafydd Gibbon
Presentation at the ELKL-4 (4th Endangered and Less Resourced Languages) conference, Agra University, India.
Types of language documentation (data and software tools).
V Międzynarodowa Konferencja Naukowa Nauka o informacji (informacja naukowa) w okresie zmian Innowacyjne usługi informacyjne. Wydział Dziennikarstwa, Informacji i Bibliologii Katedra Informatologii, Uniwersytet Warszawski, Warszawa, 15 – 16 maja 2017
In the realm of artificial intelligence (AI), speech recognition has emerged as a transformative technology, enabling machines to understand and interpret human speech with remarkable accuracy. At the heart of this technological revolution lies the availability and quality of speech recognition datasets, which serve as the building blocks for training robust yand efficient speech recognition models.
A speech recognition dataset is a curated collection of audio recordings paired with their corresponding transcriptions or labels. These datasets are essential for training machine learning models to recognize and comprehend spoken language across various accents, dialects, and environmental conditions. The quality and diversity of these datasets directly impact the performance and generalisation capabilities of speech recognition systems.
The importance of high-quality speech recognition datasets cannot be overstated. They facilitate the development of more accurate and robust speech recognition models by providing ample training data for machine learning algorithms. Moreover, they enable researchers and developers to address challenges such as speaker variability, background noise, and linguistic nuances, thus enhancing the overall performance of speech recognition systems.
One of the key challenges in building speech recognition datasets is the acquisition of diverse and representative audio data. This often involves recording a large number of speakers from different demographic backgrounds, geographic regions, and language proficiency levels. Additionally, the audio recordings must capture a wide range of speaking styles, contexts, and environmental conditions to ensure the robustness and versatility of the dataset.
Another crucial aspect of speech recognition datasets is the accuracy and consistency of the transcriptions or labels. Manual transcription of audio data is a labor-intensive process that requires linguistic expertise and meticulous attention to detail. To ensure the reliability of the dataset, transcriptions must be verified and validated by multiple annotators to minimise errors and inconsistencies.
The availability of open-source speech recognition datasets has played a significant role in advancing research and innovation in the field of AI speech technology. Projects such as the LibriSpeech dataset, CommonVoice dataset, and Google's Speech Commands dataset have provided researchers and developers with access to large-scale, annotated audio datasets, fostering collaboration and accelerating progress in speech recognition research.
Furthermore, initiatives aimed at crowdsourcing speech data, such as Mozilla's Common Voice project, have democratised the process of dataset creation by enabling volunteers from around the world to contribute their voice recordings. This approach not only helps to diversify the dataset but also empowers individuals to participate in the development of AI technologies that directly impact their lives.
Industrial Natural Language Processing & Information Extraction: a research area of the chair for technologies and management of digital transformation from the university of Wuppertal, Germany.
For more information, see here: https://www.tmdt.uni-wuppertal.de/de
1 CELI – Language and Information Gennaio 2014
2 We develop software solutions based on (NLP) Natural Language Processing
3 CELI’s offices, Countries in which we operate, Years of experience, People, Active customers, Business lines
4 Partners in Academia, Research projects, Published scientific papers
Close relationship with scientific community
5 From 1999 to 2013
6 Clients: semantic solutions, Speech Technology, Blogmeter
7 NLP solutions
8 NLP technology: Comprehensive suite of multilingual components and resource
9 Linguistic processing and annotation
10 From text to Knowledge
11 Meaningful intelligence from unstructured information
12 Speech technology: Comprehensive suite of multilingual components and resources for text processing in Voice application (Text To Speech)
13 Contribution to TTS development:Consulting and technologies
14 Semantic solutions
15 Semantic Search: Enterprise Semantic Search solution for document system and knowledge management systems
16 Linked Data for Semantic Search: Creation-ReUse of multilingual ontologies,Linking to LOD resources,Deploying LOD
17 Linked (Open) Data for Enterprise Search
18 Semantic Search Platform
19 Customer Voice Analytics: Automatic classification of customer surveys (answers to open questions) and verbatim (customer cases or call transcriptios)
20-21 Multilingual management of verbatim coding
22 Product lines (Blogmeter, Crosslibrary)
23 Social Media Monitoring, Analytics & Management Tools per Aziende & Agenzie.
24 Blogmeter: Leader in Italia nella social media intelligence,Tecnologie d’avanguardia per la social intelligence
25 Digital Humanities e Scuola Digitale
26 Leggere i classici usando il digitale
27 I Promessi sposi e Pinocchio
28 Grazie per l’attenzione!
29 Vittorio Di Tomaso ditomaso@celi.it
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
Invited Presentation in NLP lab of Soochow University, about my NLP journey and ADAPT Centre. NLP part covers Machine Translation Evaluation, Quality Estimation, Multiword Expression Identification, Named Entity Recognition, Word Segmentation, Treebanks, Parsing.
🚀 *Unlock Your Potential in the Tech World! Explore Your Career Path Today!* 🚀
Are you ready to dive into the exciting realm of technology and shape your career in cutting-edge domains? 🌐📱💻 Whether you're a budding enthusiast or an experienced professional, there's a world of opportunities waiting for you in the fields of Android & Web Development, AI/ML, Cybersecurity, Data Science, PR & Marketing, Designing, Programming Languages and Data Structures.
🔹 *Android & Web Development*: Build the digital future by creating user-friendly apps and responsive websites.
🔹 *AI/ML Enthusiasts*: Join the revolution of Artificial Intelligence and Machine Learning, making computers smarter and more capable of human-like tasks.
🔹 *Cybersecurity Guardians*: Protect digital landscapes from evolving threats, safeguarding sensitive information and ensuring the integrity of systems.
🔹 *Data Science Pioneers*: Dive into data-driven insights, unravel patterns, and make strategic decisions that shape industries and innovations.
🔹 *PR & Marketing Maestros*: Craft compelling narratives, shape brand identities, and influence trends in the fast-paced world of tech communication.
🔹 *Creative Designers*: Fuse technology with artistry; create visually stunning interfaces, logos, and graphics that leave a lasting impact.
🔹 *Coding Champions*: Master programming languages and data structures to engineer solutions that solve real-world challenges.
🔹 *Cloud Computing* Innovators: Harness the power of the cloud, revolutionize accessibility, and drive seamless digital transformation.
Embark on a journey of continuous learning and growth with resources such as online courses, workshops, webinars, and mentorship programs. Your passion, combined with the right knowledge, can lead to a fulfilling career in these dynamic domains. 🌟
Ready to take the next step?
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdfmediapraxi
The rise of virtual labs has been a key tool in universities and schools, enhancing active learning and student engagement.
💥 Let’s dive into the future of science and shed light on PraxiLabs’ crucial role in transforming this field!
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Deep Software Variability and Frictionless Reproducibility
ELKL 5 Language documentation for linguistics and technology
1. Language Documentation
for Linguistics and Technology
Or: What can we do with our documentation?
Dafydd Gibbon
U Bielefeld
ELKL-5, Ranchi, Jharkhand, India, 2017-02-24
2. 2017-02-24 Language Documentation for Linguistics and Technology 2/58
Focus: role reversal
Not just:
What can the Human Language Technologies
offer to endangered and less resourced languages?
But:
What can Human Language Engineering learn from endangered
and less resourced languages?
And:
What does documentation of endangered and less resourced
languages require from the Human Language Technologies and
their data and tool resources?
3. 2017-02-24 Language Documentation for Linguistics and Technology 3/58
Roles for computational technologies
in
language documentation / language resources
1. Documentation technologies
2. Enabling technologies
3. Productivity technologies
There is a rapidly growing number of language learning and documentation apps for many languages on
the various smartphone and tablet app stores – and some kinds are easy to make!
In Africa: from Amharic and Bambara to Yoruba and Zulu
Specifically in Nigeria: Yoruba, Hausa, Igbo, Ibibio, ...
In India there is a huge amount of relevant work going on with national, regional and local languages.
4. 2017-02-24 Language Documentation for Linguistics and Technology 4/58
1. Documentation Technologies
Project planning tools
Data collection tools
● scenario support for
– elicitation
– recording (multimodal) and annotating
– metadata collection
● document scanning, OCRing and annotating
Data archiving and access
● standardized database and search models
– relational, object-oriented, ...
Multilinear annotation
● for search, re-use analysis, application:
– sharable (sustainable, interoperable) standards
– annotation categories for phonetics, grammar, discourse, …
– semi-automatic annotation methods
5. 2017-02-24 Language Documentation for Linguistics and Technology 5/58
2. Enabling Technologies
Resource construction tools
● phonetic analysis
● lexicon induction from data
– word lists
– word frequency lists (and other word statistics)
– concordances
– collocations
● grammar induction from data
– Part of Speech (POS) tagging
– grammar induction
– parsing and generation
● translation
– multilingual dictionaries
– terminologies
– processing of parallel or comparable texts
– translator’s workbench
6. 2017-02-24 Language Documentation for Linguistics and Technology 6/58
3. Product Technologies
Recognition techniques
– Automatic Speech Recognition (ASR)
– Visual scene and object recognition
– Information retrieval from text
Identification techniques
– Speaker identification
– Language identification
– Authorship attribution
Generation techniques
– Text-to-Speech Synthesis (TTS)
– Written text geneneration from databases
Products
– Dictation and information applications
– Translation applications
7. 2017-02-24 Language Documentation for Linguistics and Technology 7/58
Some of the language Technologies
Engineering:
– speech: ASR, TTS, speaker id / recognition, ...
– language: Natural Language Processing (NLP), NL parsing,
Q&A, text mining, text classification, lexicon and grammar
induction, machine translation …
– multimodal: speech I/0 (dictation, process control, speech
computer UI), speech avatars (Siri, Cortana), gesture
(touchpad, waving), biometric systems
Computational linguistics & Computer Science:
– domain models of natural language syntax, semantics,
pragmatics, language typology and genesis
– formalisms and algorithms for induction, parsing,
generation of language
– corpus analysis for lexicon and grammar induction
9. 2017-02-24 Language Documentation for Linguistics and Technology 9/58
Preliminary definitions
The terminology is specific to disciplines:
– In linguistics: language documentation
– In the language technologies: language resources
Preliminary definitions:
– documentation as result, outcome or product
● Corpora of text inscriptions and speech recordings, plus metadata
and basic description (transcriptions, translations and annotations of
basic categories and spatial or temporal structure).
– documentation as activity, workflow or process
● Use of standardised tools, research methodology, data formats,
testing procedures for creating documentation products
The product model is useful because it suggests that there
are uses and users of documentation for a purpose.
10. Quotes on “language resources”
Resources in speech technology: “appropriate
infrastructure in terms of standardised tools, research
methodology, data formats, testing procedures”
Gibbon, Dafydd, Roger Moore, Richard Winski, eds. (1997). Handbook of Standards
and Resources for Spoken Language Systems. Berlin: Mouton de Gruyter
http://wwwhomes.uni-bielefeld.de/gibbon/Handbooks/gibbon_handbook_1997/
“Language resources are the collective materials used by
those engaged in language-related education, research
and technology development. Spanning data collections,
corpora, software, research papers and specifications,
these vital tools aid and inspire scientific progress.”
Linguistic Data Consortium
https://www.ldc.upenn.edu/language-resources
11. Quotes on “language documentation”
Language documentation (also known by the term ‘documentary
linguistics’) is the subfield of linguistics that is ‘concerned with the
methods, tools, and theoretical underpinnings for compiling a
representative and lasting multipurpose record of a natural
language or one of its varieties’ (Himmelmann 2006:v)
A similar definition is given by Woodbury (2010) as ‘the creation,
annotation, preservation, and dissemination of transparent records
of a language’.
Language documentation is by its nature multidisciplinary, and as
Woodbury (2010) notes, it draws on ‘concepts and techniques from
linguistics, ethnography, psychology, computer science, recording
arts, and more’ (see Harrison 2005, Coelho 2005, Eisenbeiss 2005
for examples).
Peter K. Austin (2010). Current issues in language documentation.
In Peter K. Austin (ed.) Language Documentation and Description, vol 7. London: SOAS. pp. 12-33
12. Language documentation / resources are required for ...
In the language technologies, statistical methods are dominant
so huge data resources are needed (e.g. billions of words daily):
Search engines (cf. Google, Bing, Baidu, ... )
100% of major players are trained and probabilistic. Their operation cannot be
described by a simple function.
Speech recognition (cf. Siri, Cortana, Alexa, Google)
100% of major systems are trained and probabilistic, mostly relying on
probabilistic hidden Markov models.
Machine translation (cf. Google, Bing, Baidu)
100% of top competitors in competitions such as NIST use statistical methods.
Some commercial systems use a hybrid of trained and rule-based approaches.
Of the 4000 language pairs covered by machine translation systems, a
statistical system is by far the best for every pair except Japanese-English,
where the top statistical system is roughly equal to the top hybrid system.
Question answering (cf. Siri, Cortana, Alexa, Google)
this application is less well-developed, and many systems build heavily on the
statistical and probabilistic approach used by search engines. The IBM Watson
system that recently won on Jeopardy is thoroughly probabilistic and trained,
while Boris Katz's START is a hybrid. All systems use at least some statistical
techniques.
Source: Peter Norvig, norvig.com
13. 2017-02-24 Language Documentation for Linguistics and Technology 13/58
From documentation-as-a-product to a products
Bing MT:
Google MT:
Speech recognition
Speech synthesis (e.g. screen readers)
Information Systems
All are heavy ‘big data’ resource users
14. 2017-02-24 Language Documentation for Linguistics and Technology 14/58
From documentation-as-a-product to a products
Speech recognition
Speech synthesis
Information Systems:
heavy data resource users
Up to you to
evaluate!
Up to you to
evaluate!
Up to you to
evaluate!
Bing MT:
Google MT:
16. 2017-02-24 Language Documentation for Linguistics and Technology 16/58
z
From documentation-as-a-product to documentation products
The meeting of the
disciplines:
Aikuma, Simputer, Text-to-Speech
And many many
more smartphone
apps, like Google
TTS, for many
languages!
The Hyderabad
Simputer CDA
Steve Bird and
his Android
fieldwork app
“Aikuma”
17. 2017-02-24 Language Documentation for Linguistics and Technology 17/58
Generic tools for documentation-as-a-process
Praat, by Paul Boersma & David Weenink
(initially a speech technology resource)
SPPAS
Brigitte Bigi
Aix-en-Provence
ELAN
MPI Nijmegen
Toolbox
SIL
19. Purpose-designed tools for documentation-as-a-process - text
Linguistics, particularly computational linguistics & Natural
Language Processing (NLP), with applications in language and
speech technologies:
●
Word sense disambiguation:
100% of top competitors at the SemEval-2 competition used statistical
techniques; most are probabilistic; some use a hybrid approach incorporating
rules from sources such as Wordnet.
● Coreference resolution:
The majority of current systems are statistical, although we should mention the
system of Haghighi and Klein, which can be described as a hybrid system that
is mostly rule-based rather than trained, and performs on par with top
statistical systems.
● Part of speech tagging:
Most current systems are statistical. The Brill tagger stands out as a
successful hybrid system: it learns a set of deterministic rules from statistical
data.
● Parsing:
There are many parsing systems, using multiple approaches. Almost all of the
most successful are statistical, and the majority are probabilistic (with a
substantial minority of deterministic parsers).
Source: Peter Norvig, norvig.com
20. 2017-02-24 Language Documentation for Linguistics and Technology 20/58
Purpose-designed tools for documentation-as-a-process - speech
PV – Prosody Visualizer
22. 2017-02-24 Language Documentation for Linguistics and Technology 22/58
Comparison of DocLing and LangTech scenarios
● The documentary linguistic scenarios are:
– rather individual, extremely heterogeneous
– rather hard to define and delimit
– de facto standards: Praat, ELAN, Wordsmith, TypeCraft,...
– somewhat ad hoc - ‘what you can get’
● Language, speech, multimodal technology scenarios are:
– highly standardised, rather coherent
– tendentially easy to define and delimit
– very application / product oriented
● especially in speech technology: highly product specific
● text technology is more generic
– regulated standards:
● statistical evaluation procedures
● institutional standards (e.g. ISO)
23. 2017-02-24 Language Documentation for Linguistics and Technology 23/58
Documentary linguistics Language technologies
Language Documentation
● texts, audio, video corpora
● dictionaries
● language structure
● language context
Motivation
● heritage preservation
● education
● linguistic insight
● ethics
Methods
● data collection
● categorial description
● tools
Language resources
● text, audio, video corpora
● dictionaries, wordnets
● language models
● language scenarios
Objectives
● system development
● software applications
● new algorithms
● marketing
Methods
● data collection
● statistical modelling
● tools
25. 2017-02-24 Language Documentation for Linguistics and Technology 25/58
Motivation Objectives
Maintenance, revitalisation
– spoken language
– text
– culture
Social payback
– language teaching
– immersive teaching
– health, marketing
Linguistic insight
– language classification
– language typology
– language and cognition
Ethics
– identity
– taboo
– human rights
System development
– speech, spoken language
– text
– multimodal
Software products
– speech recognition
– speech synthesis
– speaker recognition
Efficient algorithms
– Hidden Markov Models
– Neural Networks
– Machine Learning
Marketing
– branding
– customer satisfaction
– consumer regulations
26. 2017-02-24 Language Documentation for Linguistics and Technology 26/58
Methods
Data collection
– interview
– fieldwork
Data description
– manual annotation
– dictionary
– sketch grammar
Tools
– annotation tools
– databases, repositories
– formats
Data collection
– experiment
– fieldwork
Data modelling
– automatic annotation
– dictionary
– speech & language models
Tools
– custom annotation tools
– databases, repositories
– formats
27. 2017-02-24 Language Documentation for Linguistics and Technology 27/58
Methods
Available tools
– annotation tools
● Praat, Elan, ...
● Praat, Perl Shell scripting
– databases, repositories
● Toolbox; MPI, SOAS, ...
– formats
● ad hoc
● XML, Unicode
● IPA
● novel orthography
Standardized tools
– custom annotation tools
● (semi-)automatic
● HMMs, Deep Learning
– databases, repositories
● custom; LDC, ...
– formats
● custom
● XML, Unicode
● SAMPA (IPA)
● standard orthography
28. 2017-02-24 Language Documentation for Linguistics and Technology 28/58
Applying technologies – a reminder
Requirements
specification:
systems analysis
Product dissemination:
Reliability, sustainability,
interoperability, updating
Implementation:
Programming
methods, styles
Design:
Modules, algorithms,
data structures
Evaluation:
Black box, glass box,
field testing
A traditional
development,
feedback and
revision cycle
29. 2017-02-24 Language Documentation for Linguistics and Technology 29/58
Endangered languages as teachers: some outlets
● Speech Assessment Methodologies (EC Project)
● EAGLES: Expert Advisory Groups for Language
Engineering Standards
● Many other resources oriented European Projects,
including
– MATE
– IMDI
– ...
● LREC – Language Resources and Evaluation Conference
● Language Resources Map
– https://en.wikipedia.org/wiki/LRE_Map
● Krauwer’s BLARK: Basic Language Resource Kit
30. 2017-02-24 Language Documentation for Linguistics and Technology 30/58
Endangered languages as teachers: some models
● Steven Krauwer’s BLARK:
– Goal of equal status of European languages
– Generalisable to the world at large?
– Basic Language Resource Kit initial specification
● written language corpora
● spoken language corpora
● mono- and bilingual dictionaries
● terminology collections
● grammars
● modules (e.g. taggers, morphological analysers, parsers, speech
recognisers, text-to-speech)
● annotation standards and tools
● corpus exploration and exploitation tools
● bilingual corpora
● etc
33. 2017-02-24 Language Documentation for Linguistics and Technology 33/58
Documentation and Description: a Scale of Abstraction
For example:
Lexicography
34. 2017-02-24 Language Documentation for Linguistics and Technology 34/58
An integrative model is needed: Rank-Interpretation Architecture
(MORPHO)PHONEME
MORPHEME
LEXICAL ROOT
DERIVED WORD
COMPOUND WORD
PHRASE
CLAUSE
SENTENCE
TEXT
LEXICON–partialregularity,holisticopacity
DIALOGUE
Syntagmatic properties
Hypostatic
properties
in different
modalities:
speech
text
gesture
Paradigmatic
properties
Grammar – compositionality
PROSODIC
HIERARCHIES
structural
opacity
SEMANTICS/PRAGMATICS
CONCEPTS, OBJECTS, EVENTS
structural opacity
36. 2017-02-24 Language Documentation for Linguistics and Technology 36/58
Doc Ling and Digital Humanities – a personal view of history
Humanities:
Theology, Philology, Literary
Studies, Linguistics, Law, ...Computation
Computational Linguistics
Mathematical
Computational
Linguistics
NLP, Computational
Corpus Linguistics
Digital Humanities
Speech Technologies
Documentary
Linguistics
?
Timeline
1950
Humanities
Computing
37. 2017-02-24 Language Documentation for Linguistics and Technology 37/58
More on Enabling Technologies
Annotation and Annotation Mining
Language Similarity Analysis
Data Capture and Storage
38. 2017-02-24 Language Documentation for Linguistics and Technology 38/58
Enabling technologies
Annotation, preferably (semi-)automatic
– associating text labels and time-stamps with speech recordings
– annotation mining (preferably automatic)
– information extraction from annotations:
● text label list, text label frequencies, text label duration statistics
● visualisation of text label duration patterns, rhythm patterns
Classification, similarity analysis
– e.g. virtual distance mapping
● Which languages have been (almost) documented and can be easily
related to already documented languages?
● Geographic (areal contact)
● Typological (paradigmatic and syntagmatic structural similarity)
● Genealogical (history of language families)
Quasi-commercial applications
– e.g. text-to-speech synthesis for automatic indigenous
information services
39. 2017-02-24 Language Documentation for Linguistics and Technology 39/58
The case of Annotation
and Annotation Mining
for linguistics and technology
40. 2017-02-24 Language Documentation for Linguistics and Technology 40/58
Why annotation? And how?
The primary reason for the annotation of text and speech
data is to enable efficient search procedures
● to provide structure
● in order to make data systematically searchable
● by assigning perceptual/hermeneutic categories to data
● for the purposes of
– finding archived media
– linguistic and phonetic analysis
● development of speech and language systems
And searching unstructured data is difficult
but improving with the help of machine learning
● example: search on free text data
● Google Bing search as on-the-fly concordance construction
from web data)
● example: Google’s image search
41. 2017-02-24 Language Documentation for Linguistics and Technology 41/58
Annotation Mining: tone automaton induction for TTS
Downstepped
H allotone
Startup
effect
Upstepped
L allotone
Upstepped
L allotone
Downstepped
H allotone
42. 2017-02-24 Language Documentation for Linguistics and Technology 42/58
Tone: Kuki-Thadou
Thadou tones:
lów (H) ‘field’,
l wǒ (LH) ‘medicine’,
lòw (L) ‘negative marker’.
LH zŏng ‘monkey’
L lèn ‘big’ tones in isolation
L+H zòng lén 'bit monkey' tone sequence
Note H tone shift and L deletion.
43. 2017-02-24 Language Documentation for Linguistics and Technology 43/58
Tone: Kuki-Thadou
Thadou tones:
lów (H) ‘field’,
l wǒ (LH) ‘medicine’,
lòw (L) ‘negative marker’.
Tone N min max mean sd offset slope
H
18
(864)
200
(220)
244
(222)
221 0.29 221 -0.03
LH
17
(816)
215
(198)
237
(268)
220 7.07 209 1.3
L
18
(864)
192
(178)
213
(227)
203 6.3 215 -1.31
The descriptive statistics are over
averages of 16 pitch samples for each of
3 occurrences of each vowel with which
each tone is associated (e.g.
864=18x16x43).
The values over all measurement sets
per tone are in parentheses.
44. 2017-02-24 Language Documentation for Linguistics and Technology 44/58
The case of language classification:
Documentation beyond individual languages
Language similarity and difference as virtual distance
Aspects of Machine Learning for Language Documentation
45. 2017-02-24 Language Documentation for Linguistics and Technology 45/58
Documentation of languages and language varieties
As for individual languages:
insights into and results concerning
● diversity vs. normalization of speech and language
● the intricacy, complexity of speech, language and languages
● similarities and differences between languages
(typology, history, dispersion of language)
● scenario-dependent properties of languages
● gender, age, social role
● task orientation
● public vs. informal vs. intimate styles
● diversity of expression of emotion
● political status of languages wrt dominance, minorities
46. 2017-02-24 Language Documentation for Linguistics and Technology 46/58
Comparative studies
Language similarity
● for priority setting in language selection
– for funding (usually: chance)
– for adaptation from documentation of existing languages
● for
– language history
– language typology
Language typology:
● structural:
– similarity/difference in speech sound systems
– similarity/difference in grammar
– similarity/difference in the lexicon
● functional
– similarity/difference in discourse conventions
– similarity/difference in general cultural conventions
47. 2017-02-24 Language Documentation for Linguistics and Technology 47/58
How similar are languages? - A little similar, but not very similar
https://elms.wordpress.com/2008/03/04/lexical-distance-among-languages-of-europe/
48. 2017-02-24 Language Documentation for Linguistics and Technology 48/58
Selecting features for similarity distances
● Lexical
– Swadesh word list
– West African Lexical Dictionary Set (WALDS)
– …
● Phonetic / phonological
– vowel set comparison
– consonant set comparison (more stable – used for many
traditional initial classifications, e.g. of Indo-European
languages)
● Grammatical features
– World Atlas of Language Structures (WALS)
● An example:
– Consonants in the Kru language family
51. 2017-02-24 Language Documentation for Linguistics and Technology 51/58
Kru languages – Ivory Coast: Feature – consonant systems
Method:
1. compare all consonant systems pairwise:
Levenshtein Edit Distance / Hamming Distance
2. visualise differences as ‘virtual distances’ in a chart
54. 2017-02-24 Language Documentation for Linguistics and Technology 54/58
Which features are most useful? - Help from Machine Learning
(Decision Tree Induction)
57. 2017-02-24 Language Documentation for Linguistics and Technology 57/58
Summary
● Reversed roles:
– How can language documentation and the human language
technology resources relate to each other?
● Human Language Technologies:
– Documentary technologies
– Enabling technologies
– Product technologies
● Enabling technologies in language documentation:
– Annotation and annotation mining
● How do languages differ in speech rhythm?
– Language classification assisted by Machine Learning (ML)
● Language differences/distances among languages of Ivory Coast
● Endangered and less resourced languages:
– Enabling technologies amplify efficiency, speed, size