One of the core challenges in typology is to record properties of languages in a structured way. As a result of manual efforts, typological knowledge bases have emerged, which contains information about languages’ phonological, morphological and syntactic properties; as well as information about language families. Ideally, such typological knowledge bases would provide useful information for multilingual NLP models to learn how to selectively share parameters.
A related area of research suggests a different way of encoding properties of languages, namely to learn language representation vectors directly from text documents.
In this talk, I will analyse and contrast these two ways of encoding linguistic properties, as well as present research on how the two can benefit one another.
Neural Network Language Models for Candidate Scoring in Multi-System Machine...Matīss
This document summarizes Matīss Rikters' presentation on using neural network language models for candidate scoring in multi-system machine translation. It discusses using character-level recurrent and memory neural networks to score translations from multiple online machine translation systems. The best-performing models were a character-level RNN and a memory network, with the RNN achieving the highest BLEU score of 19.53 on a Latvian-English task. Future work discussed expanding the approach to other languages and tasks like quality estimation.
The document summarizes the 26th International Conference on Computational Linguistics (COLING) held in Osaka, Japan in December 2016. Over 1100 presenters attended, with 1039 papers submitted and a 32% acceptance rate. Key areas included neural networks, machine translation, dialog systems, and natural language processing applications. Plenary speakers addressed topics such as universal dependencies in parsing and grounded semantics for hybrid machine translation. The conference featured presentations and posters on recent research advances, including character-level named entity recognition, interactive attention for neural machine translation, and improving attention modeling for machine translation.
The document describes 4 experiments conducted on an LSTM model to analyze its internal dynamics when performing number agreement tasks. Experiment 1 identifies long-range units in the LSTM that encode singular and plural information over long distances. Experiment 2 visualizes the gate and cell state dynamics when handling easy and hard agreement contexts. Experiment 3 looks for short-range units that encode local number information. The experiments find that the LSTM uses both long-range and short-range units sparsely to perform number agreement in a way that mirrors syntactic processing.
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
Review of paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ArXiv link: https://arxiv.org/abs/1810.04805
YouTube Presentation: https://youtu.be/GK4IO3qOnLc
(Slides are written in English, but the presentation is done in Korean)
This document discusses using pedagogic corpora in English language teaching. It introduces pedagogic corpora as an alternative to directly transferring corpus linguistics research methods to the classroom. Pedagogic corpora are compiled with thematic relevance and recontextualization for authentication. The document also describes tools for annotating pedagogy in corpora and integrating corpus activities into English language and content-based instruction.
Pedagogical applications of corpus data for English for General and Specific ...Pascual Pérez-Paredes
FIAL (conférence ouverte aux chercheurs et étudiants): "Pedagogical applications of corpus data for English for General and Specific Purposes" le mercredi 4 décembre, 12h45 (local ERAS 56). UCL, Louvain-la-Neuve
This document provides an overview of deep learning for information retrieval. It begins with background on the speaker and discusses how the data landscape is changing with increasing amounts of diverse data types. It then introduces neural networks and how deep learning can learn hierarchical representations from data. Key aspects of deep learning that help with natural language processing tasks like word embeddings and modeling compositionality are discussed. Several influential papers that advanced word embeddings and recursive neural networks are also summarized.
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Normunds Grūzītis
We present a currently bilingual but potentially multilingual FrameNet-based grammar library implemented in Grammatical Framework. The contribution of this paper is two-fold. First, it offers a methodological approach to automatically generate the grammar based on semantico-syntactic valence patterns extracted from FrameNet- annotated corpora. Second, it provides a proof of concept for two use cases illustrating how the acquired multilingual grammar can be exploited in different CNL applications in the domains of arts and tourism.
Neural Network Language Models for Candidate Scoring in Multi-System Machine...Matīss
This document summarizes Matīss Rikters' presentation on using neural network language models for candidate scoring in multi-system machine translation. It discusses using character-level recurrent and memory neural networks to score translations from multiple online machine translation systems. The best-performing models were a character-level RNN and a memory network, with the RNN achieving the highest BLEU score of 19.53 on a Latvian-English task. Future work discussed expanding the approach to other languages and tasks like quality estimation.
The document summarizes the 26th International Conference on Computational Linguistics (COLING) held in Osaka, Japan in December 2016. Over 1100 presenters attended, with 1039 papers submitted and a 32% acceptance rate. Key areas included neural networks, machine translation, dialog systems, and natural language processing applications. Plenary speakers addressed topics such as universal dependencies in parsing and grounded semantics for hybrid machine translation. The conference featured presentations and posters on recent research advances, including character-level named entity recognition, interactive attention for neural machine translation, and improving attention modeling for machine translation.
The document describes 4 experiments conducted on an LSTM model to analyze its internal dynamics when performing number agreement tasks. Experiment 1 identifies long-range units in the LSTM that encode singular and plural information over long distances. Experiment 2 visualizes the gate and cell state dynamics when handling easy and hard agreement contexts. Experiment 3 looks for short-range units that encode local number information. The experiments find that the LSTM uses both long-range and short-range units sparsely to perform number agreement in a way that mirrors syntactic processing.
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
Review of paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
ArXiv link: https://arxiv.org/abs/1810.04805
YouTube Presentation: https://youtu.be/GK4IO3qOnLc
(Slides are written in English, but the presentation is done in Korean)
This document discusses using pedagogic corpora in English language teaching. It introduces pedagogic corpora as an alternative to directly transferring corpus linguistics research methods to the classroom. Pedagogic corpora are compiled with thematic relevance and recontextualization for authentication. The document also describes tools for annotating pedagogy in corpora and integrating corpus activities into English language and content-based instruction.
Pedagogical applications of corpus data for English for General and Specific ...Pascual Pérez-Paredes
FIAL (conférence ouverte aux chercheurs et étudiants): "Pedagogical applications of corpus data for English for General and Specific Purposes" le mercredi 4 décembre, 12h45 (local ERAS 56). UCL, Louvain-la-Neuve
This document provides an overview of deep learning for information retrieval. It begins with background on the speaker and discusses how the data landscape is changing with increasing amounts of diverse data types. It then introduces neural networks and how deep learning can learn hierarchical representations from data. Key aspects of deep learning that help with natural language processing tasks like word embeddings and modeling compositionality are discussed. Several influential papers that advanced word embeddings and recursive neural networks are also summarized.
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Normunds Grūzītis
We present a currently bilingual but potentially multilingual FrameNet-based grammar library implemented in Grammatical Framework. The contribution of this paper is two-fold. First, it offers a methodological approach to automatically generate the grammar based on semantico-syntactic valence patterns extracted from FrameNet- annotated corpora. Second, it provides a proof of concept for two use cases illustrating how the acquired multilingual grammar can be exploited in different CNL applications in the domains of arts and tourism.
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
This paper presents a knowledge-based approach for the grapheme to-phoneme conversion (G2P) of isolated words of the Italian language. With more than 7,000 languages in the world, the biggest challenge today is to rapidly port speech processing systems to new languages with low human effort and at reasonable cost. This includes the creation of qualified pronunciation dictionaries. The dictionaries provide the mapping from the orthographic form of a word to its pronunciation, which is useful in both speech synthesis and automatic speech recognition (ASR) systems. For training the acoustic models we need an automatic routine that maps the spelling of training set to a string of phonetic symbols representing the pronunciation.
The document discusses the concept of language factories, which aim to support the definition and construction of programming languages in a component-based way. This would allow for greater reuse of common language components, more agile language engineering, and language refactoring and analysis. Some key goals of language factories include supporting reuse of language syntax, semantics, and tools, as well as enabling flexible composition of language components to build new languages. Examples of reusable expression and measurement language components are provided.
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...Seokhwan Kim
This document describes a graph-based approach for cross-lingual projection of relation annotations from English to Korean. The approach constructs a graph with nodes for entity pairs and context words, connected by edges representing similarity. Label propagation is used to transfer annotations across the graph. Evaluation on four relations shows the graph-based approach improves over direct projection and other self-supervised methods, achieving a top F-measure of 76.3%. The approach helps alleviate errors from direct projection while leveraging contextual information.
The study evaluated the effectiveness of a computer assisted pronunciation training (CAPT) system called PARLING for teaching English pronunciation to Italian children. 28 children participated and were split into a control group that received teacher-led training and an experimental group that used PARLING. Both groups showed significant improvement in pronunciation quality from pre-test to post-test, with no significant differences between the groups, indicating that PARLING was as effective as teacher-led instruction. Difficult and unknown words showed greater improvement than easy words known by the children.
The document presents a neural network architecture for various natural language processing (NLP) tasks such as part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. It shows results comparable to state-of-the-art using word embeddings learned from a large unlabeled corpus, and improved results from joint training of the tasks. The network transforms words into feature vectors, extracts higher-level features through neural layers, and is trained via backpropagation. Benchmark results demonstrate performance on par with traditional task-specific systems without heavy feature engineering.
This document summarizes Valeria de Paiva's talk on Portuguese linguistic tools. It discusses the goals in 2010 to develop natural language processing tools for Portuguese, including content analysis, text understanding, generation, summarization, dialogue systems and question answering. It outlines the challenges in developing these tools for Portuguese, particularly the lack of lexical resources like WordNet. Much of the work since 2010 has focused on developing OpenWordNet-PT as a key lexical resource. The document also discusses using this resource to build representations of text and enable basic inference through a framework called KIML.
This document discusses using machine learning techniques like neural networks to help decipher ancient scripts and languages. It describes how character-level sequence-to-sequence models can be used to identify cognates between related languages. Additional techniques like network flows and dynamic programming are used to model monotonic character alignments and jointly segment and match tokens between known and unknown languages. The approaches are able to identify cognates between languages like Ugaritic and Hebrew as well as segment and match the unknown Iberian language. Neural models that incorporate linguistic features like phonological embeddings are shown to improve decipherment performance.
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
Here are a few approaches to address the context demand challenge for machine translation of cultural heritage content:
- Leverage knowledge graphs and ontologies to disambiguate terms based on conceptual relationships
- Train domain-specific models on large cultural heritage corpora to capture nuances of language use in different contexts
- Perform multi-task learning to optimize models for both translation accuracy and conceptual mapping between languages
- Allow users to provide feedback to iteratively improve disambiguation of ambiguous terms over time
- Develop specialized interfaces that surface contextual clues from objects to help machine translation
The goal is to mimic how humans understand intended meaning based on surrounding context clues. Combining linguistic and conceptual techniques can help machines do the same.
Improving Document Clustering by Eliminating Unnatural LanguageJinho Choi
Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering by up to 15%. Our corpus and tool are publicly available.
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...Hussein Ghaly
Main Goal:
Improve automatic syntactic parsing of spontaneous spoken sentences using prosodic cues
Theoretical Motivation:
Automatic parsing is negatively affected by syntactic ambiguity (Kummerfeld et al., 2012)
Prosody can help resolving some syntactic ambiguities (Cutler et al., 1997)
Syntactic structure is related to prosodic structure (Selkirk, 1986, among many other studies)
Learning with limited labelled data in NLP: multi-task learning and beyondIsabelle Augenstein
When labelled training data for certain NLP tasks or languages is not readily available, different approaches exist to leverage other resources for the training of machine learning models. Those are commonly either instances from a related task or unlabelled data.
An approach that has been found to work particularly well when only limited training data is available is multi-task learning.
There, a model learns from examples of multiple related tasks at the same time by sharing hidden layers between tasks, and can therefore benefit from a larger overall number of training instances and extend the models' generalisation performance. In the related paradigm of semi-supervised learning, unlabelled data as well as labelled data for related tasks can be easily utilised by transferring labels from labelled instances to unlabelled ones in order to essentially extend the training dataset.
In this talk, I will present my recent and ongoing work in the space of learning with limited labelled data in NLP, including our NAACL 2018 papers 'Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces [1] and 'From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings’ [2].
[1] https://t.co/A5jHhFWrdw
[2] https://arxiv.org/abs/1802.09375
==========
Bio from my website http://isabelleaugenstein.github.io/index.html:
I am a tenure-track assistant professor at the University of Copenhagen, Department of Computer Science since July 2017, affiliated with the CoAStAL NLP group and work in the general areas of Statistical Natural Language Processing and Machine Learning. My main research interests are weakly supervised and low-resource learning with applications including information extraction, machine reading and fact checking.
Before starting a faculty position, I was a postdoctoral research associate in Sebastian Riedel's UCL Machine Reading group, mainly investigating machine reading from scientific articles. Prior to that, I was a Research Associate in the Sheffield NLP group, a PhD Student in the University of Sheffield Computer Science department, a Research Assistant at AIFB, Karlsruhe Institute of Technology and a Computational Linguistics undergraduate student at the Department of Computational Linguistics, Heidelberg University.
Introduction to Natural Language Processinggokulprasath06
Do you want to learn NLP?
This slide will help you learn basic concepts in NLP.
Also Checkout: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
1) The document presents MultiSeg, a method for learning bilingual word embeddings using subword information like character n-grams, morphological segments, and byte-pair encoding, especially for low-resource languages.
2) MultiSeg is evaluated on tasks like word translation, word similarity, and document classification and is shown to outperform existing methods, particularly for morphologically rich languages.
3) Qualitative analysis using t-SNE visualizations indicates MultiSeg learns higher quality cross-lingual embeddings that better represent morphological variants in both languages.
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
12/22 Deep Learning勉強会@小町研 にて
"Learning Character-level Representations for Part-of-Speech Tagging" C ́ıcero Nogueira dos Santos, Bianca Zadrozny
を紹介しました。
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
This document describes a study that used machine learning models to measure the complexity of pronouncing different languages. The study trained a character-level transformer model on a grapheme-to-phoneme transliteration task for 22 languages. It found that languages with a more direct grapheme-to-phoneme mapping, like Esperanto and Malay, were easier for the model to learn than languages with a more complex mapping, like Cantonese and Japanese. The complexity of a language's pronunciation was found to correlate with how direct or simple its grapheme-to-phoneme mapping is. The study also noted that comparing languages fairly requires considering differences in available training data per language.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme
inventories, we show that certain characteristics emerge that separate easier and harder
languages with respect to learning to pronounce. Namely the complexity of a language's
pronunciation from its orthography is due to the expressive or simplicity of its grapheme-tophoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme inventories, we show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce. Namely the complexity of a language's pronunciation from its orthography is due to the expressive or simplicity of its grapheme-to phoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
This document provides an introduction and background on natural language processing (NLP). It discusses the key categories of linguistic knowledge needed for NLP, including phonetics, morphology, syntax, semantics, pragmatics, and discourse. It also explains that NLP tasks involve resolving ambiguity at these different levels of language. Common models and algorithms used in NLP are described, such as state machines, formal rule systems, logic, and probabilistic models. Machine learning approaches are also discussed for automatically learning NLP representations.
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
The document discusses ontology matching, which is the process of finding relationships between entities in different ontologies. It describes various techniques for ontology matching including basic techniques that operate at the element-level or structure-level, as well as classifications of matching techniques based on the type of input used and level of interpretation. The document also provides examples of commonly used methods for ontology matching like string-based, language-based, and structure-based techniques.
G2 pil a grapheme to-phoneme conversion tool for the italian languageijnlc
This paper presents a knowledge-based approach for the grapheme to-phoneme conversion (G2P) of isolated words of the Italian language. With more than 7,000 languages in the world, the biggest challenge today is to rapidly port speech processing systems to new languages with low human effort and at reasonable cost. This includes the creation of qualified pronunciation dictionaries. The dictionaries provide the mapping from the orthographic form of a word to its pronunciation, which is useful in both speech synthesis and automatic speech recognition (ASR) systems. For training the acoustic models we need an automatic routine that maps the spelling of training set to a string of phonetic symbols representing the pronunciation.
The document discusses the concept of language factories, which aim to support the definition and construction of programming languages in a component-based way. This would allow for greater reuse of common language components, more agile language engineering, and language refactoring and analysis. Some key goals of language factories include supporting reuse of language syntax, semantics, and tools, as well as enabling flexible composition of language components to build new languages. Examples of reusable expression and measurement language components are provided.
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...Seokhwan Kim
This document describes a graph-based approach for cross-lingual projection of relation annotations from English to Korean. The approach constructs a graph with nodes for entity pairs and context words, connected by edges representing similarity. Label propagation is used to transfer annotations across the graph. Evaluation on four relations shows the graph-based approach improves over direct projection and other self-supervised methods, achieving a top F-measure of 76.3%. The approach helps alleviate errors from direct projection while leveraging contextual information.
The study evaluated the effectiveness of a computer assisted pronunciation training (CAPT) system called PARLING for teaching English pronunciation to Italian children. 28 children participated and were split into a control group that received teacher-led training and an experimental group that used PARLING. Both groups showed significant improvement in pronunciation quality from pre-test to post-test, with no significant differences between the groups, indicating that PARLING was as effective as teacher-led instruction. Difficult and unknown words showed greater improvement than easy words known by the children.
The document presents a neural network architecture for various natural language processing (NLP) tasks such as part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. It shows results comparable to state-of-the-art using word embeddings learned from a large unlabeled corpus, and improved results from joint training of the tasks. The network transforms words into feature vectors, extracts higher-level features through neural layers, and is trained via backpropagation. Benchmark results demonstrate performance on par with traditional task-specific systems without heavy feature engineering.
This document summarizes Valeria de Paiva's talk on Portuguese linguistic tools. It discusses the goals in 2010 to develop natural language processing tools for Portuguese, including content analysis, text understanding, generation, summarization, dialogue systems and question answering. It outlines the challenges in developing these tools for Portuguese, particularly the lack of lexical resources like WordNet. Much of the work since 2010 has focused on developing OpenWordNet-PT as a key lexical resource. The document also discusses using this resource to build representations of text and enable basic inference through a framework called KIML.
This document discusses using machine learning techniques like neural networks to help decipher ancient scripts and languages. It describes how character-level sequence-to-sequence models can be used to identify cognates between related languages. Additional techniques like network flows and dynamic programming are used to model monotonic character alignments and jointly segment and match tokens between known and unknown languages. The approaches are able to identify cognates between languages like Ugaritic and Hebrew as well as segment and match the unknown Iberian language. Neural models that incorporate linguistic features like phonological embeddings are shown to improve decipherment performance.
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana
Here are a few approaches to address the context demand challenge for machine translation of cultural heritage content:
- Leverage knowledge graphs and ontologies to disambiguate terms based on conceptual relationships
- Train domain-specific models on large cultural heritage corpora to capture nuances of language use in different contexts
- Perform multi-task learning to optimize models for both translation accuracy and conceptual mapping between languages
- Allow users to provide feedback to iteratively improve disambiguation of ambiguous terms over time
- Develop specialized interfaces that surface contextual clues from objects to help machine translation
The goal is to mimic how humans understand intended meaning based on surrounding context clues. Combining linguistic and conceptual techniques can help machines do the same.
Improving Document Clustering by Eliminating Unnatural LanguageJinho Choi
Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering by up to 15%. Our corpus and tool are publicly available.
COMPUTATIONAL APPROACHES TO THE SYNTAX-PROSODY INTERFACE: USING PROSODY TO IM...Hussein Ghaly
Main Goal:
Improve automatic syntactic parsing of spontaneous spoken sentences using prosodic cues
Theoretical Motivation:
Automatic parsing is negatively affected by syntactic ambiguity (Kummerfeld et al., 2012)
Prosody can help resolving some syntactic ambiguities (Cutler et al., 1997)
Syntactic structure is related to prosodic structure (Selkirk, 1986, among many other studies)
Learning with limited labelled data in NLP: multi-task learning and beyondIsabelle Augenstein
When labelled training data for certain NLP tasks or languages is not readily available, different approaches exist to leverage other resources for the training of machine learning models. Those are commonly either instances from a related task or unlabelled data.
An approach that has been found to work particularly well when only limited training data is available is multi-task learning.
There, a model learns from examples of multiple related tasks at the same time by sharing hidden layers between tasks, and can therefore benefit from a larger overall number of training instances and extend the models' generalisation performance. In the related paradigm of semi-supervised learning, unlabelled data as well as labelled data for related tasks can be easily utilised by transferring labels from labelled instances to unlabelled ones in order to essentially extend the training dataset.
In this talk, I will present my recent and ongoing work in the space of learning with limited labelled data in NLP, including our NAACL 2018 papers 'Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces [1] and 'From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings’ [2].
[1] https://t.co/A5jHhFWrdw
[2] https://arxiv.org/abs/1802.09375
==========
Bio from my website http://isabelleaugenstein.github.io/index.html:
I am a tenure-track assistant professor at the University of Copenhagen, Department of Computer Science since July 2017, affiliated with the CoAStAL NLP group and work in the general areas of Statistical Natural Language Processing and Machine Learning. My main research interests are weakly supervised and low-resource learning with applications including information extraction, machine reading and fact checking.
Before starting a faculty position, I was a postdoctoral research associate in Sebastian Riedel's UCL Machine Reading group, mainly investigating machine reading from scientific articles. Prior to that, I was a Research Associate in the Sheffield NLP group, a PhD Student in the University of Sheffield Computer Science department, a Research Assistant at AIFB, Karlsruhe Institute of Technology and a Computational Linguistics undergraduate student at the Department of Computational Linguistics, Heidelberg University.
Introduction to Natural Language Processinggokulprasath06
Do you want to learn NLP?
This slide will help you learn basic concepts in NLP.
Also Checkout: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
1) The document presents MultiSeg, a method for learning bilingual word embeddings using subword information like character n-grams, morphological segments, and byte-pair encoding, especially for low-resource languages.
2) MultiSeg is evaluated on tasks like word translation, word similarity, and document classification and is shown to outperform existing methods, particularly for morphologically rich languages.
3) Qualitative analysis using t-SNE visualizations indicates MultiSeg learns higher quality cross-lingual embeddings that better represent morphological variants in both languages.
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
12/22 Deep Learning勉強会@小町研 にて
"Learning Character-level Representations for Part-of-Speech Tagging" C ́ıcero Nogueira dos Santos, Bianca Zadrozny
を紹介しました。
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...gerogepatton
This document describes a study that used machine learning models to measure the complexity of pronouncing different languages. The study trained a character-level transformer model on a grapheme-to-phoneme transliteration task for 22 languages. It found that languages with a more direct grapheme-to-phoneme mapping, like Esperanto and Malay, were easier for the model to learn than languages with a more complex mapping, like Cantonese and Japanese. The complexity of a language's pronunciation was found to correlate with how direct or simple its grapheme-to-phoneme mapping is. The study also noted that comparing languages fairly requires considering differences in available training data per language.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...IJITE
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme
inventories, we show that certain characteristics emerge that separate easier and harder
languages with respect to learning to pronounce. Namely the complexity of a language's
pronunciation from its orthography is due to the expressive or simplicity of its grapheme-tophoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
Learning to Pronounce as Measuring Cross Lingual Joint Orthography Phonology ...ijrap
Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language “hard to pronounce” by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model’s proficiency against its grapheme and phoneme inventories, we show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce. Namely the complexity of a language's pronunciation from its orthography is due to the expressive or simplicity of its grapheme-to phoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.
This document provides an introduction and background on natural language processing (NLP). It discusses the key categories of linguistic knowledge needed for NLP, including phonetics, morphology, syntax, semantics, pragmatics, and discourse. It also explains that NLP tasks involve resolving ambiguity at these different levels of language. Common models and algorithms used in NLP are described, such as state machines, formal rule systems, logic, and probabilistic models. Machine learning approaches are also discussed for automatically learning NLP representations.
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
The document discusses ontology matching, which is the process of finding relationships between entities in different ontologies. It describes various techniques for ontology matching including basic techniques that operate at the element-level or structure-level, as well as classifications of matching techniques based on the type of input used and level of interpretation. The document also provides examples of commonly used methods for ontology matching like string-based, language-based, and structure-based techniques.
ELKL 4, Language Technology: learning from endangered languagesDafydd Gibbon
Presentation at the ELKL-4 (4th Endangered and Less Resourced Languages) conference, Agra University, India.
Types of language documentation (data and software tools).
NLP Town's Yves Peirsman talk about how word embeddings, LSTMs/RNNs, Attention, Encoder-Decoder architectures, and more have helped NLP forward, including which challenges remain to be tackled (and some techniques to do just that).
This document summarizes a paper on using simple lexical overlap features with support vector machines (SVMs) for Russian paraphrase identification. It introduces paraphrase identification and various paraphrase corpora. It then describes a knowledge-lean approach using only tokenization, lowercasing, and overlap features like union and intersection size as inputs to linear and RBF kernel SVMs. The method achieves competitive results on English, Turkish, and Russian paraphrase identification tasks.
The document summarizes key topics from ICASSP 2022, including general trends in speech and audio processing, self-supervised and contrastive learning approaches, security applications, and topics related to tasks like multilingualism and keyword spotting. Some of the main models and techniques discussed are Wav2vec, HuBERT, contrastive learning using Conformers, intermediate layer supervision in self-supervised learning, and anonymization of speech data for privacy.
Similar to What can typological knowledge bases and language representations tell us about linguistic properties? (20)
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationIsabelle Augenstein
The document discusses modelling information change in scientific communication. It begins by noting how science is often communicated through journalists to the public, and how the message can change and become exaggerated or misleading along the way. It then discusses developing models to detect exaggeration by predicting the strength of causal claims, such as distinguishing between correlational and causal language. Pattern exploiting training is explored as a way to leverage large language models for this task in a semi-supervised manner. Finally, it proposes generally modelling information change by comparing original research to how it is communicated elsewhere, such as in news articles and tweets, using semantic matching techniques. Experiments are discussed on newly created datasets to benchmark performance of models on this task.
The document discusses automatically detecting scientific misinformation and exaggeration. It introduces work on cite-worthiness detection to improve scientific document understanding, and on detecting exaggeration in health science press releases. It describes generating scientific claims from citations for zero-shot scientific fact checking. The talk covers claim detection and generation, cite-worthiness detection, scientific claim generation, and exaggeration detection.
The past decade has seen a substantial rise in the amount of mis- and disinformation online, from targeted disinformation campaigns to influence politics, to the unintentional spreading of misinformation about public health. This development has spurred research in the area of automatic fact checking, a knowledge-intensive and complex reasoning task. Most existing fact checking models predict a claim’s veracity with black-box models, which often lack explanations of the reasons behind their predictions and contain hidden vulnerabilities. The lack of transparency in fact checking systems and ML models, in general, has been exacerbated by increased model size and by “the right…to obtain an explanation of the decision reached” enshrined in European law. This talk presents some first solutions to generating explanations for fact checking models. It then examines how to assess the generated explanations using diagnostic properties, and how further optimising for these diagnostic properties can improve the quality of the generating explanations. Finally, the talk examines how to systemically reveal vulnerabilities of black-box fact checking models.
Most work on scholarly document processing assumes that the information processed is trustworthy and factually correct. However, this is not always the case. There are two core challenges, which should be addressed: 1) ensuring that scientific publications are credible -- e.g. that claims are not made without supporting evidence, and that all relevant supporting evidence is provided; and 2) that scientific findings are not misrepresented, distorted or outright misreported when communicated by journalists or the general public. I will present some first steps towards addressing these problems and outline remaining challenges.
Towards Explainable Fact Checking (DIKU Business Club presentation)Isabelle Augenstein
Outline:
- Fact checking – what is it and why do we need it?
- False information online
- Content-based automatic fact checking
- Explainability – what is it and why do we need it?
- Making the right predictions for the right reasons
- Model training pipeline
- Explainable fact checking – some first solutions
- Rationale selection
- Generating free-text explanations
- Wrap-up
Tutorial on 'Explainability for NLP' given at the first ALPS (Advanced Language Processing) winter school: http://lig-alps.imag.fr/index.php/schedule/
The talk introduces the concepts of 'model understanding' as well as 'decision understanding' and provides examples of approaches from the areas of fact checking and text classification.
Exercises to go with the tutorial are available here: https://github.com/copenlu/ALPS_2021
Automatic fact checking is one of the more involved NLP tasks currently researched: not only does it require sentence understanding, but also an understanding of how claims relate to evidence documents and world knowledge. Moreover, there is still no common understanding in the automatic fact checking community of how the subtasks of fact checking — claim check-worthiness detection, evidence retrieval, veracity prediction — should be framed. This is partly owing to the complexity of the task, despite efforts to formalise the task of fact checking through the development of benchmark datasets.
The first part of the talk will be on automatically generating textual explanations for fact checking, thereby exposing some of the reasoning processes these models follow. The second part of the talk will be on re-examining how claim check-worthiness is defined, and how check-worthy claims can be detected; followed by how to automatically generate claims which are hard to fact-check automatically.
Talk on 'Tracking False Information Online' at W-NUT workshop at EMNLP 2019.
=========
Digital media enables fast sharing of information and discussions among users. While this comes with many benefits to today’s society, such as broadening information access, the manner in which information is disseminated also has obvious downsides. Since fast access to information is expected by many users and news outlets are often under financial pressure, speedy access often comes at the expense of accuracy, which leads to misinformation. Moreover, digital media can be misused by campaigns to intentionally spread false information, i.e. disinformation, about events, individuals or governments. In this talk, I will present on different ways false information is spread online, including misinformation and disinformation. I will then report findings from our recent and ongoing work on automatic fact checking, stance detection and framing attitudes.
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...Isabelle Augenstein
Paper presented at NAACL 2018. Link: https://arxiv.org/abs/1802.09913
Abstract:
============
We combine multi-task learning and semi-supervised learning by inducing a joint embedding space between disparate label spaces and learning transfer functions between label embeddings, enabling us to jointly leverage unlabelled data and auxiliary, annotated datasets. We evaluate our approach on a variety of sequence classification tasks with disparate label spaces. We outperform strong single and multi-task baselines and achieve a new state-of-the-art for topic-based sentiment analysis.
Spreading of mis- and disinformation is growing and is having a big impact on interpersonal communications, politics and even science.
Traditional methods, e.g. manual fact-checking by reporters cannot keep up with the growth of information. On the other hand, there has been much progress in natural language processing recently, partly due to the resurgence of neural methods.
How can natural language processing methods fill this gap and help to automatically check facts?
This talk will explore different ways to frame fact checking and detail our ongoing work on learning to encode documents for automated fact checking, as well as describe future challenges.
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...Isabelle Augenstein
Shared task summary for SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Scientific Publications
Paper: https://arxiv.org/abs/1704.02853
Abstract:
We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and materials. Although this was a new task, we had a total of 26 submissions across 3 evaluation scenarios. We expect the task and the findings reported in this paper to be relevant for researchers working on understanding scientific content, as well as the broader knowledge base population and information extraction communities.
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...Isabelle Augenstein
The document summarizes the history and goals of the WiNLP workshop, which aims to promote and support women and underrepresented groups in natural language processing. It discusses the growth of WiNLP from its inception in 2016 to the 2017 workshop with over 130 participants. It outlines WiNLP's mission to increase awareness of work by underrepresented groups and build community. It also notes challenges such as underrepresentation, bias, and lack of resources that WiNLP addresses through mentoring, funding, and community building.
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Isabelle Augenstein
The document discusses machine reading using neural machines. It presents goals of fact checking claims and understanding scientific publications. It outlines challenges in tasks like stance detection on tweets and summarizing scientific papers. These include interpreting statements based on the target or headline, handling unseen targets, and the small size of benchmark datasets which makes neural machine reading computationally costly.
Presentation of work that will be published at EMNLP 2016.
Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak, Sebastian Riedel. emoji2vec: Learning Emoji Representations from their Description. SocialNLP at EMNLP 2016. https://arxiv.org/abs/1609.08359
Georgios Spithourakis, Isabelle Augenstein, Sebastian Riedel. Numerically Grounded Language Models for Semantic Error Correction. EMNLP 2016. https://arxiv.org/abs/1608.04147
Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, Kalina Bontcheva. Stance Detection with Bidirectional Conditional Encoding. EMNLP 2016. https://arxiv.org/abs/1606.05464
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersIsabelle Augenstein
This paper describes the University of Sheffield's submission to the SemEval 2016 Twitter Stance Detection weakly supervised task (SemEval 2016 Task 6, Subtask B). In stance detection, the goal is to classify the stance of a tweet towards a target as "favor", "against", or "none". In Subtask B, the targets in the test data are different from the targets in the training data, thus rendering the task more challenging but also more realistic.
To address the lack of target-specific training data, we use a large set of unlabelled tweets containing all targets and train a bag-of-words autoencoder to learn how to produce feature representations of tweets. These feature representations are then used to train a logistic regression classifier on labelled tweets, with additional features such as an indicator of whether the target is contained in the tweet. Our submitted run on the test data achieved an F1 of 0.3270.
Paper: http://isabelleaugenstein.github.io/papers/SemEval2016-Stance.pdf
Imitation learning is used to address the problem of distant supervision for relation extraction. It decomposes the task into named entity classification (NEC) and relation extraction (RE), allowing the models to be trained separately. Through an iterative process, imitation learning is able to learn the dependencies between NEC and RE even when only labels for RE are provided. This overcomes limitations of prior approaches that rely on distantly labeled data. Evaluation shows the approach improves over baselines by leveraging multi-stage modeling to compensate for mistakes at the NEC stage.
Extracting Relations between Non-Standard Entities using Distant Supervision ...Isabelle Augenstein
Poster for our EMNLP paper on extracting non-standard relations from the Web with distant supervision and imitation learning. Read the full paper here: https://aclweb.org/anthology/D/D15/D15-1086.pdf
Slides for my tutorial at the ESWC Summer School 2015, giving an introduction to information extraction with Linked Data and an introduction to one of the applications of information extraction, opinion mining.
Signatures of wave erosion in Titan’s coastsSérgio Sacani
The shorelines of Titan’s hydrocarbon seas trace flooded erosional landforms such as river valleys; however, it isunclear whether coastal erosion has subsequently altered these shorelines. Spacecraft observations and theo-retical models suggest that wind may cause waves to form on Titan’s seas, potentially driving coastal erosion,but the observational evidence of waves is indirect, and the processes affecting shoreline evolution on Titanremain unknown. No widely accepted framework exists for using shoreline morphology to quantitatively dis-cern coastal erosion mechanisms, even on Earth, where the dominant mechanisms are known. We combinelandscape evolution models with measurements of shoreline shape on Earth to characterize how differentcoastal erosion mechanisms affect shoreline morphology. Applying this framework to Titan, we find that theshorelines of Titan’s seas are most consistent with flooded landscapes that subsequently have been eroded bywaves, rather than a uniform erosional process or no coastal erosion, particularly if wave growth saturates atfetch lengths of tens of kilometers.
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxgoluk9330
Ahota Beel, nestled in Sootea Biswanath Assam , is celebrated for its extraordinary diversity of bird species. This wetland sanctuary supports a myriad of avian residents and migrants alike. Visitors can admire the elegant flights of migratory species such as the Northern Pintail and Eurasian Wigeon, alongside resident birds including the Asian Openbill and Pheasant-tailed Jacana. With its tranquil scenery and varied habitats, Ahota Beel offers a perfect haven for birdwatchers to appreciate and study the vibrant birdlife that thrives in this natural refuge.
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆Sérgio Sacani
Context. The early-type galaxy SDSS J133519.91+072807.4 (hereafter SDSS1335+0728), which had exhibited no prior optical variations during the preceding two decades, began showing significant nuclear variability in the Zwicky Transient Facility (ZTF) alert stream from December 2019 (as ZTF19acnskyy). This variability behaviour, coupled with the host-galaxy properties, suggests that SDSS1335+0728 hosts a ∼ 106M⊙ black hole (BH) that is currently in the process of ‘turning on’. Aims. We present a multi-wavelength photometric analysis and spectroscopic follow-up performed with the aim of better understanding the origin of the nuclear variations detected in SDSS1335+0728. Methods. We used archival photometry (from WISE, 2MASS, SDSS, GALEX, eROSITA) and spectroscopic data (from SDSS and LAMOST) to study the state of SDSS1335+0728 prior to December 2019, and new observations from Swift, SOAR/Goodman, VLT/X-shooter, and Keck/LRIS taken after its turn-on to characterise its current state. We analysed the variability of SDSS1335+0728 in the X-ray/UV/optical/mid-infrared range, modelled its spectral energy distribution prior to and after December 2019, and studied the evolution of its UV/optical spectra. Results. From our multi-wavelength photometric analysis, we find that: (a) since 2021, the UV flux (from Swift/UVOT observations) is four times brighter than the flux reported by GALEX in 2004; (b) since June 2022, the mid-infrared flux has risen more than two times, and the W1−W2 WISE colour has become redder; and (c) since February 2024, the source has begun showing X-ray emission. From our spectroscopic follow-up, we see that (i) the narrow emission line ratios are now consistent with a more energetic ionising continuum; (ii) broad emission lines are not detected; and (iii) the [OIII] line increased its flux ∼ 3.6 years after the first ZTF alert, which implies a relatively compact narrow-line-emitting region. Conclusions. We conclude that the variations observed in SDSS1335+0728 could be either explained by a ∼ 106M⊙ AGN that is just turning on or by an exotic tidal disruption event (TDE). If the former is true, SDSS1335+0728 is one of the strongest cases of an AGNobserved in the process of activating. If the latter were found to be the case, it would correspond to the longest and faintest TDE ever observed (or another class of still unknown nuclear transient). Future observations of SDSS1335+0728 are crucial to further understand its behaviour. Key words. galaxies: active– accretion, accretion discs– galaxies: individual: SDSS J133519.91+072807.4
Order : Trombidiformes (Acarina) Class : Arachnida
Mites normally feed on the undersurface of the leaves but the symptoms are more easily seen on the uppersurface.
Tetranychids produce blotching (Spots) on the leaf-surface.
Tarsonemids and Eriophyids produce distortion (twist), puckering (Folds) or stunting (Short) of leaves.
Eriophyids produce distinct galls or blisters (fluid-filled sac in the outer layer)
Mechanics:- Simple and Compound PendulumPravinHudge1
a compound pendulum is a physical system with a more complex structure than a simple pendulum, incorporating its mass distribution and dimensions into its oscillatory motion around a fixed axis. Understanding its dynamics involves principles of rotational mechanics and the interplay between gravitational potential energy and kinetic energy. Compound pendulums are used in various scientific and engineering applications, such as seismology for measuring earthquakes, in clocks to maintain accurate timekeeping, and in mechanical systems to study oscillatory motion dynamics.
Dr. Firoozeh Kashani-Sabet is an innovator in Middle Eastern Studies and approaches her work, particularly focused on Iran, with a depth and commitment that has resulted in multiple book publications. She is notable for her work with the University of Pennsylvania, where she serves as the Walter H. Annenberg Professor of History.
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...Sérgio Sacani
We present the JWST discovery of SN 2023adsy, a transient object located in a host galaxy JADES-GS
+
53.13485
−
27.82088
with a host spectroscopic redshift of
2.903
±
0.007
. The transient was identified in deep James Webb Space Telescope (JWST)/NIRCam imaging from the JWST Advanced Deep Extragalactic Survey (JADES) program. Photometric and spectroscopic followup with NIRCam and NIRSpec, respectively, confirm the redshift and yield UV-NIR light-curve, NIR color, and spectroscopic information all consistent with a Type Ia classification. Despite its classification as a likely SN Ia, SN 2023adsy is both fairly red (
�
(
�
−
�
)
∼
0.9
) despite a host galaxy with low-extinction and has a high Ca II velocity (
19
,
000
±
2
,
000
km/s) compared to the general population of SNe Ia. While these characteristics are consistent with some Ca-rich SNe Ia, particularly SN 2016hnk, SN 2023adsy is intrinsically brighter than the low-
�
Ca-rich population. Although such an object is too red for any low-
�
cosmological sample, we apply a fiducial standardization approach to SN 2023adsy and find that the SN 2023adsy luminosity distance measurement is in excellent agreement (
≲
1
�
) with
Λ
CDM. Therefore unlike low-
�
Ca-rich SNe Ia, SN 2023adsy is standardizable and gives no indication that SN Ia standardized luminosities change significantly with redshift. A larger sample of distant SNe Ia is required to determine if SN Ia population characteristics at high-
�
truly diverge from their low-
�
counterparts, and to confirm that standardized luminosities nevertheless remain constant with redshift.
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxshubhijain836
Centrifugation is a powerful technique used in laboratories to separate components of a heterogeneous mixture based on their density. This process utilizes centrifugal force to rapidly spin samples, causing denser particles to migrate outward more quickly than lighter ones. As a result, distinct layers form within the sample tube, allowing for easy isolation and purification of target substances.
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
What can typological knowledge bases and language representations tell us about linguistic properties?
1. Typ-NLP Workshop
1 August 2019
What can typological
knowledge bases and
language representations
tell us about linguistic
properties?
Isabelle Augenstein*
augenstein@di.ku.dk
@IAugenstein
http://isabelleaugenstein.github.io/
*Credit for many of the slides: Johannes Bjerva
2. Linguistic Typology
2
● ‘The systematic study and comparison of language
structures’ (Velupillai, 2012)
● Long history (Herder, 1772; von der Gabelentz, 1891; …)
● Computational approaches (Dunn et al., 2011; Wälchli,
2014; Östling, 2015, ...)
3. Why Computational Typology?
3
● Answer linguistic research questions on large scale
● About relationships between languages
● About relationships between structural features of languages
● Facilitate multilingual learning
○ Cross-lingual transfer
○ Few-shot or zero-shot learning
4. How to Obtain Typological Knowledge?
4
● Discrete representation of language features in typological
knowledge bases
● World Atlas of Language Structures (WALS)
● Continuous representation of language features via
language embeddings
● Learned via language modelling
5. Why Computational Typology?
5
● Answer linguistic research questions on large scale
● Multilingual learning
○ Language representations
○ Cross-lingual transfer
○ Few-shot or zero-shot learning
● This talk:
○ Features in the World Atlas of Language Structures (WALS)
○ Computational Typology via unsupervised modelling of languages
in neural networks
8. Can language representations be learned from data?
Resources that exist for many languages:
● Universal Dependencies (>60 languages)
● UniMorph (>50 languages)
● New Testament translations (>1,000 languages)
● Automated Similarity Judgment Program (>4,500
languages)
8
9. Multilingual NLP and Language Representations
● No explicit representation
○ Multilingual Word Embeddings
● Google’s “Enabling zero-shot
learning” NMT trick
○ Language given explicitly in
input
● One-hot encodings
○ Languages represented as a
sparse vector
● Language Embeddings
○ Languages represented as a
distributed vector
9
(Östling and Tiedemann, 2017)
10. Experimental Setup
Data
● Pre-trained language embeddings (Östling and Tiedemann, 2017)
○ Trained via Language Modelling on New Testament data
● PoS annotation from Universal Dependencies for
○ Finnish
○ Estonian
○ North Sami
○ Hungarian
Task
● Fine-tune language embeddings on PoS tagging
● Investigate how typological properties are encoded in these for four
Uralic languages
10
13. Talk Overview
Part 1: Language Embeddings
- Do they aid multilingual parameter sharing?
- Do they encode typological properties?
- What types of similarities between languages do they encode?
Part 2: Typological Knowledge Bases
- Can they be populated automatically?
- Can they be used to discover typological implications?
13
15. Parameter sharing between dependency
parsers for related languages
Miryam de Lhoneux, Johannes Bjerva,
Isabelle Augenstein, Anders Søgaard
EMNLP 2018
15
16. Cross-lingual sharing with language embeddings
● Do language embeddings help to learn soft sharing
strategies?
● Use case: transition-based dependency parsing
● Types of parameters:
● Character embeddings
● Word embeddings
● Transition parameters (MLP)
● Ablation with language embedding concatenated with
char, word or transition vector
16
17. Cross-lingual sharing with language embeddings
17
Lang Tokens Family Word order
ar 208,932 Semitic VSO
he 161,685 Semitic SVO
et 60,393 Finnic SVO
fi 67,258 Finnic SVO
hr 109,965 Slavic SVO
ru 90,170 Slavic SVO
it 113,825 Romance SVO
es 154,844 Romance SVO
nl 75,796 Germanic No dom. order
no 76,622 Germanic SVO
Table 1: Dataset characteristics
classifier parameters always helps, whereas the
usefulness of sharing LSTM parameters depends
transition is used to gene
pendency trees (Nivre, 200
For an input sentence o
w1, . . . , wn, the parser cre
tors x1:n, where the vecto
the concatenation of a wor
nal state of the character-b
cessing the characters of w
ch(wi) is obtained by run
LSTM over the character
of wi. Each input elemen
word-level, bi-directional L
BILSTM(x1:n, i). For each
ture extractor concatenates
tions of core elements fro
Both the embeddings and
together with the model.
A configuration c is re
18. Cross-lingual sharing with language embeddings
19
78
78,2
78,4
78,6
78,8
79
79,2
79,4
79,6
79,8
80
AVG
Mono Lang-Best Best All Soft
• Mono: single-task baseline
• Lang-best: best sharing strategy for each language
• Best: best sharing strategy across languages (char not shared,
word shared, transition shared with language embedding)
• All: all parameters shared
• Soft: sharing learned using language embeddings
19. Related vs. Unrelated Languages
22
78,00
78,20
78,40
78,60
78,80
79,00
79,20
79,40
79,60
79,80
80,00
AVG
Mono Lang-Best Best All Soft
• Mono: single-task baseline
• Lang-best: best sharing strategy for each language
• Best: best sharing strategy across languages (char not shared,
word shared, transition shared with language embedding)
• All: all parameters shared
• Soft: sharing learned using language embeddings
20. Tracking Typological Traits of Uralic
Languages in Distributed Language
Representations
Johannes Bjerva, Isabelle Augenstein
IWCLUL 2018
24
21. Language Embeddings in Deep Neural Networks
25
1. Do language
embeddings aid
multilingual modelling?
2. Do language
embeddings contain
typological
information?
22. Model performance (Monolingual PoS tagging)
26
• Compared to most
frequent class
baseline (black line)
• Model transfer
between Finnic
languages relatively
successful
• Little effect from
language
embeddings (to be
expected)
23. Model performance (Multilingual PoS tagging)
27
• Compared to
monolingual baseline
(black line)
• Model transfer
between Finnic
languages
outperforms
monolingual baseline
• Language
embeddings improve
multilingual modelling
24. Tracking Typological Traits (full language sample)
28
• Baseline: Most frequent
typological class in sample
• Language embeddings saved
at each training epoch
• Separate Logistic Regression
classifier trained for each
feature and epoch
• Input: Language
embedding
• Output: Typological class
• Typological features encoded
in language embeddings
change during training
25. Tracking Typological Traits (Uralic languages held out)
29
• Some typological
features can be
predicted with high
accuracy for the
unseen Uralic
languages.
26. Cross-lingual sharing with language embeddings: Summary
● Conclusions
● Sharing high-level features more useful than low-level features
● If languages are unrelated, sharing low-level features hurts
performance
● Language embeddings help
● Only tested for (selected) language pairs
● Sharing for more languages
● Only tested for selected tasks (parsing, PoS tagging)
● Language embeddings pre-trained or trained end-to-end
● Soft or hard sharing based on typological KBs?
30
27. From Phonology to Syntax: Unsupervised
Linguistic Typology at Different Levels with
Language Embeddings
Johannes Bjerva, Isabelle Augenstein
NAACL HLT 2018
31
28. Language Embeddings in Deep Neural Networks
32
Do language
embeddings contain
typological information?
- Predict typological
features
- Study unsupervised vs.
fine-tuned embeddings
29. Research Questions
● RQ 1: Which typological properties are encoded in task-
specific distributed language representations, and can we
predict phonological, morphological and syntactic
properties of languages using such representations?
● RQ 2: To what extent do the encoded properties change as
the representations are fine-tuned for tasks at different
linguistic levels?
● RQ 3: How are language similarities encoded in fine-tuned
language embeddings?
33
30. Phonological Features
34
● 20 features
● E.g. descriptions of the
consonant and vowel
inventories, presence of tone
and stress markers
31. Morphological Features
35
● 41 features
● Features from morphological
and nominal chapter
● E.g. number of genders, usage
of definite and indefinite articles
and reduplication
36. Part-of-Speech Tagging (UD)
47
- Improvements for all experimental settings
-> Pre-trained and fine-tuned language embeddings encode
features relevant to word order
System/Features Random lang/feat
pairs from word
order features
Random
lang/feat pairs
from all features
Most frequent class 67.81% 82.93%
k-NN (pre-trained) 76.66% 82.69%
k-NN (fine-tuned) *80.81% 83.55%
37. Conclusions
50
- Language embeddings can encode typological features
- Works for morphological inflection and PoS tagging
- Does not work for phonological tasks
- We can predict typological features for unseen language families
with high accuracies
- G2P task: phonological differences between otherwise similar
languages (e.g. Norwegian Bokmål and Danish) are accurately
encoded
38. What do Language Representations Really
Represent?
Johannes Bjerva, Robert Östling, Maria Han
Veiga, Jörg Tiedemann, Isabelle Augenstein
Computational Linguistics 2019
51
39. Language Representations encode Language Similarities
52
• Similar languages – similar representations
• ...similar how?
• Can reconstruct language family trees (Rabinovich et al. 2017)!
• So... Language family (genetic) similarity?
40. What Do Language Representations Really Represent?
53
presentations encapsulate is further hinted at
f language vectors in Östling and Tiedemann
models (Johnson et al. 2017).
Structural distance?
Family distance?
Geographical distance?
{en
fr
es
pt
de
nl
Figure 1
Language representations in a
two-dimensional space. What do
their similarities represent?
prelimi-
anguage
n (2017),
Johnson
odelling
ng mul-
p on the
er (2017)
consist-
find that
space is
However,
ng a cor-
Language embeddings in a two-dimensional space.
What do their similarities represent?
41. Language Representations from Monolingual Texts
54
• Input: official translations from EU languages to English (EuroParl)
• Train multilingual LM on various levels of abstraction
• Evaluate resulting language representations
Bjerva et al. What do Language Representations Really Represent?
Czech source
Swedish source
Official
translation
… …
Multilingual language model
Multilingual language model
Multilingual language model
CS For example , in my country , the Czech Republic English translation
CS ADP NOUN PUNCT ADP ADJ NOUN PUNCT DET PROPN PROPN POS
CS prep pobj punct prep poss pobj punct det compound nsubj DepRel
SE In Stockholm , we must make comparisons and learn English translation
SE ADP PROPN PUNCT PRON VERB VERB NOUN CCONJ VERB POS
SE prep pobj punct nsubj aux ROOT dobj cc conj DepRel
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
(2017) who investigate representation learning on
monolingual English sentences, which are translations
from various source languages to English from
the Europarl corpus (Koehn, 2005). They employ
a feature-engineering approach to predict source
languages and learn an Indo-European (IE) family
tree using their language representations. Crucially,
they posit that the relationships found between their
representations encode the genetic relationships
between languages. They use features based on
sequences of POS tags, function words and cohesive
markers. We significantly expand on this work by
comparing three language similarity measures (§4).
By doing this, we offer a stronger explanation of what
language representations really represent.
3 Method
Figure 2 illustrates the data and problem we consider
in this paper. We are given a set of English gold-
standard translations from the official languages of
the European Union, based on speeches from the
European Parliament.1 We wish to learn language
representations based on this data, and investigate the
linguistic relationships which hold between the result-
ing representations (RQ2). For this to make sense, it
is important to abstract away from the surface forms of
the translations as, e.g., speakers from certain regions
will tend to talk about the same issues. We therefore
introduce several levels of abstraction: i) training on
of POS tags. Our model is similar to ¨Ostling and
Tiedemann (2017), who train a character-based
multilingual language model using a 2-layer LSTM,
with the modification that each time-step includes
a representation of the language at hand. That is to
say, each input to their LSTM is represented both
by a character representation, c, and a language
representation, l2L. Since the set of language repre-
sentations L is updated during training, the resulting
representations encode linguistic properties of the
languages. Whereas ¨Ostling and Tiedemann (2017)
model hundreds of languages, we model only English
- however, we redefine L to be the set of source
languages from which our translations originate.
LPOS Lraw LDepRel
4 Comparing Languages
We compare the resulting language embeddings to
three different types of language distance measures:
genetic distance estimated by methods from histor-
ical linguistics, geographical distance of speaker
communities, and a novel measure for the structural
distances between languages. As previously stated,
our goal with this is to investigate whether it really
is the genetic distances between languages which
are captured by language representations, or if other
distance measures provide more explanation (RQ2).
4.1 Genetic Distance
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
us source languages to English from
l corpus (Koehn, 2005). They employ
ngineering approach to predict source
nd learn an Indo-European (IE) family
heir language representations. Crucially,
hat the relationships found between their
ons encode the genetic relationships
nguages. They use features based on
f POS tags, function words and cohesive
We significantly expand on this work by
three language similarity measures (§4).
s, we offer a stronger explanation of what
presentations really represent.
d
strates the data and problem we consider
r. We are given a set of English gold-
nslations from the official languages of
an Union, based on speeches from the
arliament.1 We wish to learn language
ons based on this data, and investigate the
ationships which hold between the result-
tations (RQ2). For this to make sense, it
to abstract away from the surface forms of
ons as, e.g., speakers from certain regions
talk about the same issues. We therefore
multilingual language model using a 2-layer LSTM,
with the modification that each time-step includes
a representation of the language at hand. That is to
say, each input to their LSTM is represented both
by a character representation, c, and a language
representation, l2L. Since the set of language repre-
sentations L is updated during training, the resulting
representations encode linguistic properties of the
languages. Whereas ¨Ostling and Tiedemann (2017)
model hundreds of languages, we model only English
- however, we redefine L to be the set of source
languages from which our translations originate.
LPOS Lraw LDepRel
4 Comparing Languages
We compare the resulting language embeddings to
three different types of language distance measures:
genetic distance estimated by methods from histor-
ical linguistics, geographical distance of speaker
communities, and a novel measure for the structural
distances between languages. As previously stated,
our goal with this is to investigate whether it really
is the genetic distances between languages which
are captured by language representations, or if other
distance measures provide more explanation (RQ2).
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
most closely related to Rabinovich et al.
nvestigate representation learning on
English sentences, which are translations
source languages to English from
corpus (Koehn, 2005). They employ
ineering approach to predict source
d learn an Indo-European (IE) family
ir language representations. Crucially,
t the relationships found between their
ns encode the genetic relationships
uages. They use features based on
POS tags, function words and cohesive
significantly expand on this work by
ree language similarity measures (§4).
we offer a stronger explanation of what
esentations really represent.
rates the data and problem we consider
We are given a set of English gold-
slations from the official languages of
Union, based on speeches from the
liament.1 We wish to learn language
s based on this data, and investigate the
ionships which hold between the result-
tions (RQ2). For this to make sense, it
abstract away from the surface forms of
s as, e.g., speakers from certain regions
the input sequences themselves are, e.g., sequences
of POS tags. Our model is similar to ¨Ostling and
Tiedemann (2017), who train a character-based
multilingual language model using a 2-layer LSTM,
with the modification that each time-step includes
a representation of the language at hand. That is to
say, each input to their LSTM is represented both
by a character representation, c, and a language
representation, l2L. Since the set of language repre-
sentations L is updated during training, the resulting
representations encode linguistic properties of the
languages. Whereas ¨Ostling and Tiedemann (2017)
model hundreds of languages, we model only English
- however, we redefine L to be the set of source
languages from which our translations originate.
LPOS Lraw LDepRel
4 Comparing Languages
We compare the resulting language embeddings to
three different types of language distance measures:
genetic distance estimated by methods from histor-
ical linguistics, geographical distance of speaker
communities, and a novel measure for the structural
distances between languages. As previously stated,
our goal with this is to investigate whether it really
is the genetic distances between languages which
are captured by language representations, or if other
distance measures provide more explanation (RQ2).
Figure 2
Problem illustration. Given official translations from EU languages to English, we train
multilingual language models on various levels of abstractions, encoding the source languages.
The resulting source language representations (Lraw etc.) are evaluated.
languages, having an incorrect view of the structure of the language representation
space can be dangerous. For instance, the standard assumption of genetic similarity
would imply that the representation of the Gagauz language (Turkic, spoken mainly
in Moldova) should be interpolated from the genetically very close Turkish, but this
would likely lead to poor performance in syntactic tasks since the two languages have
42. Tree Distance Evaluation
55
• Hierarchical clustering of language embeddings
• Compare resulting trees with gold phylogenetic trees
• Hierarchical clustering of cosine distances (Rabinovich et al. 2017)
nd Petroni (2008), using the distance metric from Rabinovich, Ordan,
017).3
Our generated trees yield comparable results to previous work
Condition Mean St.d.
Raw text (LM-Raw) 0.527 -
Function words and POS (LM-Func) 0.556 -
Only POS (LM-POS) 0.517 -
Phrase-structure (LM-Phrase) 0.361 -
Dependency Relations (LM-Deprel) 0.321 -
POS trigrams (ROW17) 0.353 0.06
Random (ROW17) 0.724 0.07
Table 1
Tree distance evaluation (lower is better, cf. §5.1).
ling using lexi-
and POS tags.
ments deal with
y on the raw
This is likely to
tions by speak-
erent countries
pecific issues or
Figure 2), and
l comparatively
n to work with
explicit syntac-
available. As a
the lack of explicit syntactic information, it is unsurprising that the
w in Table 1) only marginally outperform the random baseline.
away from the content and negate the geographical effect we train
on only function words and POS. This performs almost on par with
Func in Table 1), indicating that the level of abstraction reached is not
pture similarities between languages. We next investigate whether we
y abstract away from the content by removing function words, and only
43. 56
Distance Measures
• Family distance (following Rabinovich et al. 2017)
• Geographic distance
• Using Glottlog geocoordinates (Hammarström et al. 2017)
• Structural distance
vs
44. • Language embedding similarities most strongly correlate with
structural similarities
• Less strong correlation with genetic similarities, even though
phylogenetic trees can be faithfully reconstructed (Rabinovich et al.
2017)
57
Figure 4
Correlations between similarities (Genetic, Geo
and Struct.) and language representations (Raw
Func, POS, Phrase, Deprel). Significance at
p < 0.001 is indicated by *.
reconstruct
s in a sim-
work, we
genetic re-
ages really
esentations
matrices A⇢,
resents the
th
and jth
similarity
hen, the en-
ise genetic
mming the
Analysis of Similarities
54. Contributions
• Greenberg’s universals are binary – However, correlations are
rarely 100%, so we implement a probabilisation of typology
• Framed as typological collaborative filtering,
exploiting correlations between languages and features
• We exploit raw linguistic data by adding a semi-supervised
extension
68
55. World Atlas of Language Structure (WALS)
70
Ø 2,500 languages
Ø 192 features
Feature 81A – Order of Subject, Object and Verb
56. WALS is Sparse and Skewed
71
Ø Sparse:
Most languages are
covered by only a
handful of features
Ø Skewed:
A few features have
much wider
coverage than
others
57. World Atlas of Language Structure (WALS)
72
Ø 2,500 languages
800 languages
Ø 192 features
160 features
Feature 81A – Order of Subject, Object and Verb
74. Semi-supervised Extension - Interpretability
97
Typological Feature
Prediction
Multilingual Language
Modelling
Compressing linguistic information
75. Evaluation
• Controlling for
Genetic Relationships
• Train on all out-of-family
data
• [0, 1, 5, 10, 20]% in-family
data
• Observed features in
matrix
• With/without pre-
trained language
embeddings
98
78. Uncovering Probabilistic Implications in
Typological Knowledge Bases
Johannes Bjerva, Yova Kementchedjhieva,
Ryan Cotterell, Isabelle Augenstein
ACL 2019
108
79. Linguistic Typology and Greenberg’s Universals
109
VO languages have prepositions
OV languages have postpositions
80. From Correlations to Probabilistic Implications
110
Visualisation of a section of the induced graphical model.
Observing the features in the left-most nodes (SV, OV, and
Noun-Adjective), can we correctly infer the value of the
right-most node (SVO)?
Computer Science, Johns Hopkins University
er Science and Technology, University of Cambridge
genstein@di.ku.dk, rdc42@cam.ac.uk
rooted in
linguistic
uages with
have post-
ations typi-
manual pro-
d linguists,
stic univer-
present a
sfully iden-
Greenberg
ones, wor-
n. Our ap-
ously used
g baseline
SV
OV
SVO
Noun-
Adjective
Figure 1: Visualisation of a section of our induced
graphical model. Observing the features in the left-
most nodes (SV, OV, and Noun-Adjective), can we cor-
rectly infer the value of the right-most node (SVO)?
one should consider universals to maintain the plau-
sibility of the data (Wang and Eisner, 2016). Com-
81. Accuracies for feature prediction in a typologically diverse
test set, across number of implicants used
111
ory sizes across languages
and Haspelmath (2013)).
s is a standard technique
we are interested in pre-
res given others. If we
observed features for a
N implicants 2 3 4 5 6
Phonology 0.75 0.82 0.84 0.86 0.89
Morphology 0.77 0.85 0.87 0.70 0.82
Nominal Categories 0.72 0.83 0.80 0.84 0.81
Nominal Syntax 0.77 0.89 0.85 0.89 0.81
Verbal Categories 0.80 0.84 0.80 0.86 0.90
Word Order 0.74 0.86 0.86 0.86 0.93
Clause 0.75 0.81 0.84 0.85 0.84
Complex 0.82 0.83 0.87 0.93 0.84
Lexical 0.83 0.76 0.75 0.85 0.79
Mean 0.77 0.83 0.83 0.85 0.85
Most freq. 0.30
Pairwise 0.77
PRA 0.81
Language embeddings 0.85
Table 1: Accuracies for feature prediction in a typo-
logically diverse test set, across number of implicants
used. Note that the numbers are not comparable across
columns nor to the baseline, since each makes a differ-
ent number of predictions.
Feature Prediction Accuracies
82. Hand-picked implications. In cases where the same is
covered by Daumé III and Campbell (2007), we borrow
their analysis (marked with *)
112
# Implicant Implicand
1* Postpositions Genitive-Noun (Greenberg #2a)
2* Postpositions OV (Greenberg #4)
3 OV SV
4* Postpositions SV
5* Prepositions VO (Greenberg #4)
6* Prepositions Initial subord. word (Lehmann)
7* Adjective-Noun, Postpositions Demonstrative-Noun
8* Genitive-Noun, Adjective-Noun OV
9 SV
OV
Noun-Adjective SOV
10 Degree word-Adjective
VO and Noun–Relative Clause
SVO Numeral-Noun
11 SOV
OV and Relative Clause–Noun
Adjective-Degree word Noun-Numeral
Table 2: Hand-picked implications. In cases where the
same is covered by Daum´e III and Campbell (2007),
we borrow their analysis (marked with *).
two-tailed t-test.2 After adjusting for multiple tests
power goes to the Degree
conditioned on this featur
der holds in 79% of the
combination of all three
hand, results in a subset
Numeral-Noun order. T
can thus be implied with
dence from a combination
5 Related Work
Typological implication
possible languages, base
served languages, as reco
guists (Greenberg, 1963; L
1983). While work in this
ual, typological knowled
(Dryer and Haspelmath,
Levin, 2016), which allow
Probabilistic Implications Found
84. Conclusions: This Talk
114
Part 1: Language Representations
- Improve performance for multilingual
sharing (de Lhoneaux et al. 2018)
- Encode typological properties
- task-specific fine-tuned ones even
more so than ones only obtained using
language modelling (Bjerva &
Augenstein 2018a,b)
- Can be used to reconstruct phylogenetic
trees (Rabinovich et al. 2017, Östling &
Tiedemann 2017)
- … but actually mostly represent
structural similarities between
languages (Bjerva et al., 2019a)
85. Conclusions: This Talk
115
Part 2: Typological
Knowledge Bases
- Can be populated automatically
with high accuracy using KBP
methods (Bjerva et al. 2019b)
- Language embeddings further
improve performance
- Can be used to discover
probabilistic implications
- With one or multiple implicants
- Including Greenberg universals
87. Presented Papers
Miryam de Lhoneux, Johannes Bjerva, Isabelle Augenstein, Anders Søgaard.
Parameter sharing between dependency parsers for related languages. EMNLP
2018.
Johannes Bjerva, Isabelle Augenstein. Tracking Typological Traits of Uralic
Languages in Distributed Language Representations. Fourth International
Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2018).
Johannes Bjerva, Isabelle Augenstein. Unsupervised Linguistic Typology at
Different Levels with Language Embeddings. NAACL HLT 2018.
Johannes Bjerva, Robert Östling, Maria Han Veiga, Jörg Tiedemann, Isabelle
Augenstein. What do Language Representations Really Represent?
Computational Linguistics, Vol. 45, No. 2, June 2019.
Johannes Bjerva, Yova Kementchedjhieva, Ryan Cotterell, Isabelle Augenstein.
A Probabilistic Generative Model of Linguistic Typology. NAACL 2019.
Johannes Bjerva, Yova Kementchedjhieva, Ryan Cotterell, Isabelle Augenstein.
Uncovering Probabilistic Implications in Typological Knowledge Bases. ACL 2019.
117
88. Thanks to my collaborators and advisees!
Johannes Bjerva, Ryan Cotterell, Yova Kementchedjhieva, Miryam de
Lhoneux, Robert Östling, Maria Han Veiga, Jörg Tiedemann
118