The document discusses challenges in cross-language word alignment. It outlines topics including word alignment concepts and applications, state of the art, and limitations due to phenomena like multiword units. Guidelines are presented for annotating alignments between English, French, Portuguese and Spanish, including challenges like prepositional dependencies, multiword units, and contractions. The goal is to create linguistically informed gold standard alignment sets to help machine translation tasks.
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
The document discusses a phrase tagset mapping between French and English treebanks and its application in machine translation evaluation. Key points:
- A universal phrase tagset with 9 categories was designed to map phrase tags from the French Treebank and English Penn Treebank.
- The tagset mapping aims to facilitate multilingual research by bridging differences in treebank tagsets.
- An unsupervised machine translation evaluation method was proposed that uses the universal tagset to compare phrase categories between source and translated sentences, without needing reference translations.
- Experiments on French-English translation tasks showed promising results, with the unsupervised method correlating reasonably well with BLEU and TER scores. However, there is still
Learning phoneme mappings for transliteration without parallel dataAttaporn Ninsuwan
The document presents a method for learning cross-language phoneme mappings without parallel data by framing transliteration as a decipherment problem and using monolingual resources to learn mappings between English and Japanese phonemes. It compares this unsupervised approach to a supervised approach using parallel data and finds the unsupervised method achieves 40% accuracy on a name transliteration task, similar to the supervised approach. The goal is to develop transliteration systems that do not require parallel resources for any language pair.
Requirements Engineering: focus on Natural Language Processing, Lecture 2alessio_ferrari
In this lecture, we give a practical guide on how to detect ambiguities in natural language requirements by means of GATE and by means of Python. A brief guide to Python is also included.
The previous lecture gives an introduction to the problem of ambiguity in requirements engineering. Find it here: https://www.slideshare.net/alessio_ferrari/requirements-engineering-focus-on-natural-language-processing-lecture-1
This paper reports our first attempt of integrating eSPERTo’s paraphrastic engine, which is based on NooJ platform, with two application scenarios: a conversational agent, and a summarization system. We briefly describe eSPERTo’s base resources, and the necessary modifications to these resources
that enabled the production of paraphrases required to feed both systems. Although the improvement observed in both scenarios is not significant, we present a detailed error analysis to further improve the achieved results in future experiments.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
(Final) cidoc 2009 chinese lang translation of the aatAAT Taiwan
The document summarizes the methods and issues involved in developing a Chinese version of the Art & Architecture Thesaurus (AAT). It discusses the project's methodology, which includes equivalence mapping, translation, creating scope notes for new concepts, and expert review. Challenges include mapping culture-specific Chinese terms, translating terms with broader meanings, and issues of completeness and interpretation. The project aims to integrate Chinese cultural heritage concepts into a multilingual knowledge network.
This document describes a multilingual lexical simplification system for four Ibero-Romance languages: Spanish, Portuguese, Catalan, and Galician. The system uses a modular hybrid linguistic-statistical architecture that is the same across languages, with language-specific resources. It identifies complex words, performs word sense disambiguation, ranks synonyms, and inflects words using morphological generation. The system was evaluated on its ability to produce adequate and simple simplifications according to native speaker assessments.
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
The document discusses a phrase tagset mapping between French and English treebanks and its application in machine translation evaluation. Key points:
- A universal phrase tagset with 9 categories was designed to map phrase tags from the French Treebank and English Penn Treebank.
- The tagset mapping aims to facilitate multilingual research by bridging differences in treebank tagsets.
- An unsupervised machine translation evaluation method was proposed that uses the universal tagset to compare phrase categories between source and translated sentences, without needing reference translations.
- Experiments on French-English translation tasks showed promising results, with the unsupervised method correlating reasonably well with BLEU and TER scores. However, there is still
Learning phoneme mappings for transliteration without parallel dataAttaporn Ninsuwan
The document presents a method for learning cross-language phoneme mappings without parallel data by framing transliteration as a decipherment problem and using monolingual resources to learn mappings between English and Japanese phonemes. It compares this unsupervised approach to a supervised approach using parallel data and finds the unsupervised method achieves 40% accuracy on a name transliteration task, similar to the supervised approach. The goal is to develop transliteration systems that do not require parallel resources for any language pair.
Requirements Engineering: focus on Natural Language Processing, Lecture 2alessio_ferrari
In this lecture, we give a practical guide on how to detect ambiguities in natural language requirements by means of GATE and by means of Python. A brief guide to Python is also included.
The previous lecture gives an introduction to the problem of ambiguity in requirements engineering. Find it here: https://www.slideshare.net/alessio_ferrari/requirements-engineering-focus-on-natural-language-processing-lecture-1
This paper reports our first attempt of integrating eSPERTo’s paraphrastic engine, which is based on NooJ platform, with two application scenarios: a conversational agent, and a summarization system. We briefly describe eSPERTo’s base resources, and the necessary modifications to these resources
that enabled the production of paraphrases required to feed both systems. Although the improvement observed in both scenarios is not significant, we present a detailed error analysis to further improve the achieved results in future experiments.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
(Final) cidoc 2009 chinese lang translation of the aatAAT Taiwan
The document summarizes the methods and issues involved in developing a Chinese version of the Art & Architecture Thesaurus (AAT). It discusses the project's methodology, which includes equivalence mapping, translation, creating scope notes for new concepts, and expert review. Challenges include mapping culture-specific Chinese terms, translating terms with broader meanings, and issues of completeness and interpretation. The project aims to integrate Chinese cultural heritage concepts into a multilingual knowledge network.
This document describes a multilingual lexical simplification system for four Ibero-Romance languages: Spanish, Portuguese, Catalan, and Galician. The system uses a modular hybrid linguistic-statistical architecture that is the same across languages, with language-specific resources. It identifies complex words, performs word sense disambiguation, ranks synonyms, and inflects words using morphological generation. The system was evaluated on its ability to produce adequate and simple simplifications according to native speaker assessments.
This document provides an overview of the basics of Python programming. It discusses that Python is a simple, easy to learn, high-level, interpreted, object-oriented, extensible and portable programming language. It has an extensive standard library and is free and open source. The document also provides a brief history of Python, describing its creation and key releases over time. It outlines some basic Python concepts like variables, data types, operators, expressions and input/output.
This document presents machine learning techniques to identify verb subcategorization frames from Czech language corpora. It compares three statistical techniques and shows they can discover new subcategorization frames and label dependents as arguments or adjuncts with 88% accuracy. It describes the task, relevant properties of Czech including free word order and case marking, and how the techniques are applied without assuming a predefined frame set.
Realization of natural language interfaces usingunyil96
The document discusses research on using lazy functional programming (LFP) to build natural language interfaces (NLIs). LFP involves delaying evaluation of function arguments until needed. Over 45 researchers have investigated using LFP for NLI design and implementation due to similarities between some linguistic theories and LFP theories. The research has resulted in over 60 papers on using LFP for natural language processing tasks like syntactic and semantic analysis. The paper provides a comprehensive survey of this research area at the intersection of computer science and computational linguistics.
Natural language processing (NLP) involves developing systems that can process and understand human language. This document discusses NLP tools and techniques for representing text numerically so it can be analyzed by machine learning algorithms. It covers topics like tokenization, part-of-speech tagging, named entity recognition, vector space models, term frequency-inverse document frequency (TF-IDF) weighting, and word embeddings which represent words as dense vectors of numbers. Popular Python libraries for NLP and text analysis are also introduced.
Seeing is Correcting:Linked Open Data for PortugueseValeria de Paiva
OpenWordNet-PT is an open-source lexical database for Portuguese based on Princeton WordNet. It aims to create a Portuguese wordnet linked to Princeton's architecture while including senses specific to Portuguese. The resource is developed through manual and semi-automatic processes and contains over 50,000 synsets. It is distributed as RDF and accessible through a SPARQL endpoint and web interface that allows user suggestions and voting to improve data quality. Future work includes expanding coverage through corpora and improving relations between synsets.
- The document describes a PhD thesis defense about using rewriting logic to define the semantics of concurrent programming languages.
- The thesis proposes K as a framework for programming language definitions in rewriting logic, which aims to be more expressive, modular, and concurrent than existing approaches.
- It demonstrates K and its execution in Maude by defining the semantics of a simple concurrent language called KernelC.
This document discusses linguistic diversity in open-source software development. It presents models to calculate the probability that two random developers would not speak the same programming language. It then uses data from StackOverflow to determine a similarity measure between languages based on how often they are spoken together. This measure is used to calculate the risk of not finding replacement developers for a particular language. The document concludes that for Python, the risk depends on its similarity to other languages, which is low based on it being frequently spoken alongside other languages.
1. The C language was invented by Dennis Ritchie in 1972 at Bell Labs by combining features from the B and BCPL languages. It allows both high-level and low-level programming.
2. C has advantages like being easy to write, having built-in operators and functions, supporting bit-wise operations and pointers, and having direct control over hardware. Disadvantages include being difficult to learn, having code that can be hard to follow, and not being well-suited for report formatting or heavy data file manipulation.
3. C is considered a middle-level language as it combines features of low-level assembly languages and high-level languages, allowing both system-level and application programming
Linguistic markup and transclusion processing in XML documentsSimon Dew
Transclusion can have linguistic consequences. This presentation proposes a markup scheme that can be used to indicate [a] the required form (e.g. syntactic case) of a transcluded term in an XML document, and [b] the syntactic features of the transcluded term that demand agreement in the surrounding document. It also describes a set of XSLT transformations that can be used to select the correct form of any dependent words in the surrounding document, using dictionaries that conform to the TEI.
Analyzing and sharing large amounts of data is a must for people working in many fields of scientific applications. It is usual that compiled languages are preferred in such environments, mainly because they came first, but also importantly, because of their performance. Nonetheless, interpreted languages like Perl or Python are gaining an important share of scientific users mainly because of their flexibility and greater productivity when writing code.
During this talk, the advantages of using Python will be stressed in scenarios where data should be analyzed interactively and shared between a number of users, even on remote locations. In particular, we will show PyTables, a Python library that is meant to access HDF5 data on a convenient, but also efficient, way. CSTables, a library that works on top of PyTables allowing remote access to HDF5 file repositories, will also be described.
Logics and Ontologies for Portuguese UnderstandingValeria de Paiva
The document summarizes the development of logics and ontologies for Portuguese understanding. It discusses the PARC Bridge system from 1999-2008 as inspiration for the goals in 2010 to develop similar NLP components for Portuguese. It describes the creation of OpenWordNet-PT and NomLex-PT as key lexical resources, and their uses in applications like FreeLing and sentiment analysis. Issues with OpenWordNet-PT are noted along with ongoing efforts to expand and link it to other resources like NomLex-PT.
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Normunds Grūzītis
We present a currently bilingual but potentially multilingual FrameNet-based grammar library implemented in Grammatical Framework. The contribution of this paper is two-fold. First, it offers a methodological approach to automatically generate the grammar based on semantico-syntactic valence patterns extracted from FrameNet- annotated corpora. Second, it provides a proof of concept for two use cases illustrating how the acquired multilingual grammar can be exploited in different CNL applications in the domains of arts and tourism.
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
The document describes a hybrid approach to part-of-speech disambiguation that combines neural networks and manually crafted rules. The algorithm uses neural networks to generate a set of possible part-of-speech tags for each word, and rule-based tagging to generate another set. The final set of tags is the intersection of these two sets, or their union if the intersection is empty. The approach achieved 96.11% precision on one corpus and 86.39% precision on another larger corpus.
This paper presents a methodology to extract a paraphrase database for the European and Brazilian varieties of Portuguese, and discusses a set of paraphrastic categories of multiwords and
phrasal units, such as the compounds toda a gente vs todo o mundo "everybody" or the gerundive constructions [estar a + V-Inf] vs [ficar + V-Ger] (e.g., estive a observar vs fiquei observando "I was observing"), which are extremely relevant to high quality paraphrasing. The variants were manually aligned in the e-PACT corpus, using the CLUE-Aligner tool. The methodology, inspired
in the Logos Model, focuses on a semantico-syntactic analysis of each paraphrastic unit and constitutes a subset of the Gold-CLUE-Paraphrases.1 The construction of a larger dataset of
paraphrastic contrasts among the distinct varieties of the Portuguese language is indispensable for variety adaptation, i.e., for dealing with the cultural, linguistic and stylistic differences between them, making it possible to convert texts (semi-)automatically from one variety into another, a
key function in paraphrasing systems. This topic represents an interesting new line of research with valuable applications in language learning, language generation, question-answering, summarization, and machine translation, among others. The paraphrastic units are the first resource of its kind for Portuguese to become available to the scientific community for research purposes.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesAntonio Toral
Parallel corpora play a central role in current approaches to machine and computer-assisted translation and also in any corpus-based study that involves original text and its translation. This talk motivates the use of parallel data, as well as its desired properties. It then introduces practical methodologies to automatically acquire and prepare parallel data for the task at hand. Finally, it glances at the neighbouring field of Translation Studies to assert that translations can differ to a great extent depending on the strategy followed by the translator, which might lead to the translation being more or less appropriate for its use in corpus-based studies.
The document describes OpenLogos, an open-source machine translation system that uses knowledge-rich bilingual dictionaries. These dictionaries contain extensive semantic and syntactic information for entries using the Semanic-Syntactic Abstraction Language (SAL). Three English-to-other language dictionaries were created for research purposes containing over 80,000 entries each. The goal is to make the lexical resources freely available to help develop new NLP tools, especially for under-resourced languages.
An Introduction to Pre-training General Language Representationszperjaccico
This document provides an overview of pre-training general language representations. It discusses early methods like ELMo and GPT that used bidirectional and autoregressive language models. It then focuses on BERT, explaining its bidirectional transformer architecture and pre-training objectives of masked language modeling and next sentence prediction. The document outlines extensions to BERT like ALBERT which aims to reduce parameters. It also discusses Chinese models like ERNIE and MT-BERT which were adapted from BERT for the Chinese language.
Este documento es una traducción al español de un capítulo de una novela. Cuenta la historia de Holder, quien confronta a Grayson, el novio de su hermana Les, después de encontrarlo besando a otra chica en una fiesta. Holder obliga a Grayson a llamar a Les para terminar su relación. Aunque esto lastima a su hermana, Holder cree que se merece algo mejor que Grayson.
Este documento trata sobre varios temas macroeconómicos en Chile como la inflación, deflación, IPC, PIB, desempleo y equidad. Explica que la inflación se mide a través del IPC y cómo este índice ha variado en los últimos meses. También define la deflación, describe cómo se mide el PIB y cómo ha sido la trayectoria del PIB en Chile, manteniendo el mayor PIB per cápita de América Latina. Además, explica el concepto de desempleo, sus causas y cómo ha sido la tasa
El documento explica la computación en la nube. La nube permite almacenar información como música, videos y archivos en servidores a través de Internet en lugar de en un computador local. Funciona accediendo a información almacenada en servidores remotamente a través de una conexión a Internet. Ofrece ventajas como acceso a la información desde cualquier lugar y dispositivo, almacenamiento ilimitado y seguridad de la información ante fallas del dispositivo local. Sin embargo, también plantea preocupaciones sobre la seguridad de la información personal.
This document provides an overview of the basics of Python programming. It discusses that Python is a simple, easy to learn, high-level, interpreted, object-oriented, extensible and portable programming language. It has an extensive standard library and is free and open source. The document also provides a brief history of Python, describing its creation and key releases over time. It outlines some basic Python concepts like variables, data types, operators, expressions and input/output.
This document presents machine learning techniques to identify verb subcategorization frames from Czech language corpora. It compares three statistical techniques and shows they can discover new subcategorization frames and label dependents as arguments or adjuncts with 88% accuracy. It describes the task, relevant properties of Czech including free word order and case marking, and how the techniques are applied without assuming a predefined frame set.
Realization of natural language interfaces usingunyil96
The document discusses research on using lazy functional programming (LFP) to build natural language interfaces (NLIs). LFP involves delaying evaluation of function arguments until needed. Over 45 researchers have investigated using LFP for NLI design and implementation due to similarities between some linguistic theories and LFP theories. The research has resulted in over 60 papers on using LFP for natural language processing tasks like syntactic and semantic analysis. The paper provides a comprehensive survey of this research area at the intersection of computer science and computational linguistics.
Natural language processing (NLP) involves developing systems that can process and understand human language. This document discusses NLP tools and techniques for representing text numerically so it can be analyzed by machine learning algorithms. It covers topics like tokenization, part-of-speech tagging, named entity recognition, vector space models, term frequency-inverse document frequency (TF-IDF) weighting, and word embeddings which represent words as dense vectors of numbers. Popular Python libraries for NLP and text analysis are also introduced.
Seeing is Correcting:Linked Open Data for PortugueseValeria de Paiva
OpenWordNet-PT is an open-source lexical database for Portuguese based on Princeton WordNet. It aims to create a Portuguese wordnet linked to Princeton's architecture while including senses specific to Portuguese. The resource is developed through manual and semi-automatic processes and contains over 50,000 synsets. It is distributed as RDF and accessible through a SPARQL endpoint and web interface that allows user suggestions and voting to improve data quality. Future work includes expanding coverage through corpora and improving relations between synsets.
- The document describes a PhD thesis defense about using rewriting logic to define the semantics of concurrent programming languages.
- The thesis proposes K as a framework for programming language definitions in rewriting logic, which aims to be more expressive, modular, and concurrent than existing approaches.
- It demonstrates K and its execution in Maude by defining the semantics of a simple concurrent language called KernelC.
This document discusses linguistic diversity in open-source software development. It presents models to calculate the probability that two random developers would not speak the same programming language. It then uses data from StackOverflow to determine a similarity measure between languages based on how often they are spoken together. This measure is used to calculate the risk of not finding replacement developers for a particular language. The document concludes that for Python, the risk depends on its similarity to other languages, which is low based on it being frequently spoken alongside other languages.
1. The C language was invented by Dennis Ritchie in 1972 at Bell Labs by combining features from the B and BCPL languages. It allows both high-level and low-level programming.
2. C has advantages like being easy to write, having built-in operators and functions, supporting bit-wise operations and pointers, and having direct control over hardware. Disadvantages include being difficult to learn, having code that can be hard to follow, and not being well-suited for report formatting or heavy data file manipulation.
3. C is considered a middle-level language as it combines features of low-level assembly languages and high-level languages, allowing both system-level and application programming
Linguistic markup and transclusion processing in XML documentsSimon Dew
Transclusion can have linguistic consequences. This presentation proposes a markup scheme that can be used to indicate [a] the required form (e.g. syntactic case) of a transcluded term in an XML document, and [b] the syntactic features of the transcluded term that demand agreement in the surrounding document. It also describes a set of XSLT transformations that can be used to select the correct form of any dependent words in the surrounding document, using dictionaries that conform to the TEI.
Analyzing and sharing large amounts of data is a must for people working in many fields of scientific applications. It is usual that compiled languages are preferred in such environments, mainly because they came first, but also importantly, because of their performance. Nonetheless, interpreted languages like Perl or Python are gaining an important share of scientific users mainly because of their flexibility and greater productivity when writing code.
During this talk, the advantages of using Python will be stressed in scenarios where data should be analyzed interactively and shared between a number of users, even on remote locations. In particular, we will show PyTables, a Python library that is meant to access HDF5 data on a convenient, but also efficient, way. CSTables, a library that works on top of PyTables allowing remote access to HDF5 file repositories, will also be described.
Logics and Ontologies for Portuguese UnderstandingValeria de Paiva
The document summarizes the development of logics and ontologies for Portuguese understanding. It discusses the PARC Bridge system from 1999-2008 as inspiration for the goals in 2010 to develop similar NLP components for Portuguese. It describes the creation of OpenWordNet-PT and NomLex-PT as key lexical resources, and their uses in applications like FreeLing and sentiment analysis. Issues with OpenWordNet-PT are noted along with ongoing efforts to expand and link it to other resources like NomLex-PT.
Controlled Natural Language Generation from a Multilingual FrameNet-based Gra...Normunds Grūzītis
We present a currently bilingual but potentially multilingual FrameNet-based grammar library implemented in Grammatical Framework. The contribution of this paper is two-fold. First, it offers a methodological approach to automatically generate the grammar based on semantico-syntactic valence patterns extracted from FrameNet- annotated corpora. Second, it provides a proof of concept for two use cases illustrating how the acquired multilingual grammar can be exploited in different CNL applications in the domains of arts and tourism.
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
The document describes a hybrid approach to part-of-speech disambiguation that combines neural networks and manually crafted rules. The algorithm uses neural networks to generate a set of possible part-of-speech tags for each word, and rule-based tagging to generate another set. The final set of tags is the intersection of these two sets, or their union if the intersection is empty. The approach achieved 96.11% precision on one corpus and 86.39% precision on another larger corpus.
This paper presents a methodology to extract a paraphrase database for the European and Brazilian varieties of Portuguese, and discusses a set of paraphrastic categories of multiwords and
phrasal units, such as the compounds toda a gente vs todo o mundo "everybody" or the gerundive constructions [estar a + V-Inf] vs [ficar + V-Ger] (e.g., estive a observar vs fiquei observando "I was observing"), which are extremely relevant to high quality paraphrasing. The variants were manually aligned in the e-PACT corpus, using the CLUE-Aligner tool. The methodology, inspired
in the Logos Model, focuses on a semantico-syntactic analysis of each paraphrastic unit and constitutes a subset of the Gold-CLUE-Paraphrases.1 The construction of a larger dataset of
paraphrastic contrasts among the distinct varieties of the Portuguese language is indispensable for variety adaptation, i.e., for dealing with the cultural, linguistic and stylistic differences between them, making it possible to convert texts (semi-)automatically from one variety into another, a
key function in paraphrasing systems. This topic represents an interesting new line of research with valuable applications in language learning, language generation, question-answering, summarization, and machine translation, among others. The paraphrastic units are the first resource of its kind for Portuguese to become available to the scientific community for research purposes.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Parallel Corpora in (Machine) Translation: goals, issues and methodologiesAntonio Toral
Parallel corpora play a central role in current approaches to machine and computer-assisted translation and also in any corpus-based study that involves original text and its translation. This talk motivates the use of parallel data, as well as its desired properties. It then introduces practical methodologies to automatically acquire and prepare parallel data for the task at hand. Finally, it glances at the neighbouring field of Translation Studies to assert that translations can differ to a great extent depending on the strategy followed by the translator, which might lead to the translation being more or less appropriate for its use in corpus-based studies.
The document describes OpenLogos, an open-source machine translation system that uses knowledge-rich bilingual dictionaries. These dictionaries contain extensive semantic and syntactic information for entries using the Semanic-Syntactic Abstraction Language (SAL). Three English-to-other language dictionaries were created for research purposes containing over 80,000 entries each. The goal is to make the lexical resources freely available to help develop new NLP tools, especially for under-resourced languages.
An Introduction to Pre-training General Language Representationszperjaccico
This document provides an overview of pre-training general language representations. It discusses early methods like ELMo and GPT that used bidirectional and autoregressive language models. It then focuses on BERT, explaining its bidirectional transformer architecture and pre-training objectives of masked language modeling and next sentence prediction. The document outlines extensions to BERT like ALBERT which aims to reduce parameters. It also discusses Chinese models like ERNIE and MT-BERT which were adapted from BERT for the Chinese language.
Este documento es una traducción al español de un capítulo de una novela. Cuenta la historia de Holder, quien confronta a Grayson, el novio de su hermana Les, después de encontrarlo besando a otra chica en una fiesta. Holder obliga a Grayson a llamar a Les para terminar su relación. Aunque esto lastima a su hermana, Holder cree que se merece algo mejor que Grayson.
Este documento trata sobre varios temas macroeconómicos en Chile como la inflación, deflación, IPC, PIB, desempleo y equidad. Explica que la inflación se mide a través del IPC y cómo este índice ha variado en los últimos meses. También define la deflación, describe cómo se mide el PIB y cómo ha sido la trayectoria del PIB en Chile, manteniendo el mayor PIB per cápita de América Latina. Además, explica el concepto de desempleo, sus causas y cómo ha sido la tasa
El documento explica la computación en la nube. La nube permite almacenar información como música, videos y archivos en servidores a través de Internet en lugar de en un computador local. Funciona accediendo a información almacenada en servidores remotamente a través de una conexión a Internet. Ofrece ventajas como acceso a la información desde cualquier lugar y dispositivo, almacenamiento ilimitado y seguridad de la información ante fallas del dispositivo local. Sin embargo, también plantea preocupaciones sobre la seguridad de la información personal.
This document provides the agenda and background information for the "Transparent Ulaanbaatar Anti-Corruption Forum" taking place on October 6-7, 2014 in Ulaanbaatar, Mongolia. The forum brings together government officials, civil society, and international experts to discuss initiatives and strategies to reduce corruption. Over the two days, there will be panel discussions on topics like government and civil society engagement, land allocation, and procurement procedures. Participants will also work in breakout groups to develop an action plan as part of the mayor's efforts to make Ulaanbaatar a model for anti-corruption efforts. Background is provided on Ulaanbaatar, the organizing partners, and welcome messages from the
A Wet Seal é uma rede de lojas de moda com mais de 500 lojas nos EUA e Porto Rico, oferecendo roupas a preços acessíveis e promoções frequentes. Sua missão é usar a moda para ajudar as clientes a expressarem sua individualidade e se sentirem bem, enquanto se destacam. O documento fornece detalhes sobre a rede de lojas, sua missão e links para vídeos e redes sociais com mais conteúdo.
Este documento presenta un resumen de un prólogo de una novela. Cuenta la historia de un chico llamado Daniel que se esconde en un armario de mantenimiento durante su quinto período. Una chica entra llorando y terminan besándose apasionadamente. Ella huye y no regresa por una semana, pero luego vuelve al armario diciendo que hoy no está triste.
Treinamento de Professores - CFAPED - Ass.Deus Ministério de Cordovil- Ana Paula Baptista
O documento discute os conceitos e funções da avaliação no contexto educacional, especificamente no manual da Escola Dominical. A avaliação pode ser diagnóstica, formativa ou somativa, e cada uma tem objetivos distintos de fornecer informações sobre o aprendizado e melhorar o ensino.
El documento presenta varios ejercicios de programación neurolingüística que involucran leer el color de palabras escritas en diferentes colores y identificar figuras geométricas independientemente de sus nombres escritos.
CLUE-Aligner is an interactive tool for annotating pairs of paraphrastic and translated linguistic units. It allows the alignment of contiguous and discontiguous multiword units through a matrix visualization. Alignments are classified as "sure" or "possible" based on criteria of optimal or approximate semantic equivalence. The tool was inspired by previous alignment applications but addresses current shortcomings like a lack of support for discontiguous multiwords. Future work includes using aligned units to train machine learning models for paraphrasing and machine translation applications.
1) The document discusses a linguistic evaluation of support verb constructions performed on the OpenLogos and Google Translate machine translation systems.
2) A corpus of 100 sentences containing support verb constructions was translated into several languages by each system and evaluated both quantitatively and qualitatively.
3) The evaluation found that OpenLogos translated more support verb constructions correctly thanks to its use of linguistic rules and representations, while Google Translate struggled more with non-contiguous and idiomatic constructions due to its statistical nature.
Poster presented at the 2nd meeting of the COST Action CA16105 - enetCollect : European Network for Combining Language Learning with Crowdsourcing Techniques, which took place at Alexandru Ioan Cuza University, in Iasi, Romania.
This poster shows paraphrastic suggestions in the eSPERTo paraphrasing system applied to a QA application on a virtual agent and to a summarization tool. It also shows how paraphrases can be used in language learning and the tests envisaged to make eSPERTo a Portuguese learning tool.
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
Body-Part Nouns and Whole-Part Relations in PortugueseJorge Baptista
In this paper, we target the extraction of whole-part rela- tions involving human entities and body-part nouns in SYSTEM, a hy- brid statistical and rule-based Natural Language Processing chain for Portuguese. Whole-part relation is a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set.
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
The document presents a method for unsupervised machine translation evaluation using universal phrase tags. It designs a mapping between phrase tags from different treebanks to 9 universal tags. An unsupervised metric called HPPR is introduced to measure similarity between the universal phrase sequences of the source and translated sentences. Experiments on French-English data show HPPR achieves promising correlations with human judgments without using reference translations.
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
This paper describes a universal phrase tagset mapping between the French Treebank and English Penn Treebank using 9 phrase categories. It then applies this mapping to an unsupervised machine translation evaluation method that calculates similarity between the source and target sentences without reference translations. The method extracts phrase tags from the source and target, maps them to universal tags, and measures n-gram precision, recall, and position difference as similarity metrics. Evaluation on French-English data shows promising correlation with human judgments, though there is still room for improvement. The tagset and methods could facilitate future multilingual research.
This document summarizes a presentation about a sentiment analysis system developed for a large Korean telecommunications company. The system was designed to analyze customer feedback from call centers. It classified feedback into categories, identified trends over time, and detected complaints. The system used Korean linguistic analysis and sentiment classification. It showed the benefits of combining machine learning and rules-based approaches. However, challenges remained around data quality, lexicon development, and meeting customer expectations. Future work focused on improving the sentiment dictionary and developing a platform for ongoing natural language processing services.
The document discusses several key aspects of programming languages including:
1) There is amazing variety across languages with over 2300 published languages grouped into four main families: imperative, functional, logic, and object-oriented.
2) Programming languages are the subject of ongoing debates around their relative merits and definitions.
3) Languages are constantly evolving as new ideas are introduced and older languages develop new dialects.
4) Languages influence programming practices but programmers can also work against a language's favored style.
1) The document describes a method for cross-modal knowledge distillation from pretrained language models to improve end-to-end spoken language understanding systems.
2) A pretrained BERT model is fine-tuned on text transcripts then used as a teacher to distill knowledge into a student end-to-end speech SLU system using mean absolute error loss.
3) Experimental results found this simple distillation approach helped the student learn uncertainty from the teacher model and improved performance over a speech-only baseline, demonstrating cross-modal knowledge sharing is effective for spoken language tasks.
The document summarizes an academic thesis defense presentation on evaluating machine translation. It introduces the background of machine translation evaluation (MTE), existing MTE methods like BLEU, METEOR, WER, and their weaknesses. It then outlines the designed model for a new MTE metric called LEPOR, including designed factors like an enhanced length penalty and n-gram position difference penalty. The document concludes by discussing experiments, enhanced models, and applications in shared tasks to evaluate LEPOR's performance.
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
The document provides an overview of machine translation evaluation (MTE). It discusses existing MTE methods like BLEU, METEOR, WER, and their weaknesses. The author's thesis proposes a new metric called LEPOR that incorporates additional factors to address weaknesses. The additional factors include an enhanced length penalty, n-gram position difference penalty, and tunable parameters to handle cross-language performance differences. The thesis will experiment with LEPOR on various language pairs and shared tasks to evaluate its performance.
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
These are the slides for the technical briefing given at ICSE 2021, given by Alessio Ferrari, Liping Zhao, and Waad Alhoshan
It covers RE tasks to which NLP is applied, an overview of a recent systematic mapping study on the topic, and a hands-on tutorial on using transfer learning for requirements classification.
Please find the links to the colab notebooks here:
https://colab.research.google.com/drive/158H-lEJE1pc-xHc1ISBAKGDHMt_eg4Gn?usp=sharing
https://colab.research.google.com/d rive/1B_5ow3rvS0Qz1y-KyJtlMNnm gmx9w3kJ?usp=sharing
https://colab.research.google.com/d rive/1Xrm0gNaa41YwlM5g2CRYYX cRvpbDnTRT?usp=sharing
How to expand your nlp solution to new languages using transfer learningLena Shakurova
Expanding NLP models to new languages typically involves annotating new data sets which is time and resource expensive. To reduce the costs one can use cross-lingual embeddings enabling knowledge transfer from languages with sufficient training data to low-resource languages. In this talk, you will hear about the challenges in learning cross-lingual embeddings for multilingual resume parsing.
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
Presentation PPT in MT SUMMIT 2013.
Language-independent Model for Machine Translation Evaluation with Reinforced Factors
International Association for Machine Translation2013
Authors: Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yervant Ho, Yi Lu, Anson Xing, Samuel Zeng
Proceedings of the 14th biennial International Conference of Machine Translation Summit (MT Summit 2013). Nice, France. 2 - 6 September 2013. Open tool https://github.com/aaronlifenghan/aaron-project-hlepor (Machine Translation Archive)
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
12/22 Deep Learning勉強会@小町研 にて
"Learning Character-level Representations for Part-of-Speech Tagging" C ́ıcero Nogueira dos Santos, Bianca Zadrozny
を紹介しました。
This document describes a survey conducted by experts on automatic knowledge acquisition for lexicography. It aims to create an inventory of different types of automatic knowledge acquisition currently used in lexicographic projects. The survey found that the most commonly acquired knowledge automatically includes lemma lists, frequency information, example sentences, and grammatical patterns. Knowledge is either directly integrated into published dictionaries or requires human review first. Respondents expressed a need for more work on semantic relations, definitions, and domain-specific knowledge acquisition. Automatic methods show promise but also limitations for lexicography.
The document discusses formal language theory and its applications in natural language processing (NLP). It covers two main goals in computational linguistics - theoretical interest in formally characterizing natural language and practical interest in using well-understood frameworks like finite state models to solve NLP problems. Finite state devices are widely used in NLP tasks due to their efficiency and ability to model linguistic phenomena like words through dictionaries and rules. While finite state models provide a useful approximation of language, natural languages pose challenges like ambiguity, long distance dependencies and non-regular features that require extensions to basic finite state models.
The document discusses the development of OpenWN-PT, a Brazilian Portuguese Wordnet. Key points:
- OpenWN-PT is being created as part of a joint project between CPDOC and EMAp to apply formal logical tools to Portuguese text.
- It is based on the Universal Wordnet (UWN) which projects WordNet concepts into over 200 languages using statistical methods. The UWN provides an initial automated version of a Portuguese Wordnet.
- The creators are working to improve the initial UWN-based Portuguese Wordnet by combining it with data from Princeton WordNet, UWN, MENTA, and EuroWordNet to generate a new OpenWN-PT file.
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningCITE
HU, Yuxiu (Harbin Institute of Technology Shenzhen Graduate School, China)
BODOMO, Adams (The University of Hong Kong)
http://citers2013.cite.hku.hk/en/paper_603.htm
---------------------------
Author(s) bear(s) the responsibility in case of any infringement of the Intellectual Property Rights of third parties.
---------------------------
CITE was notified by the author(s) that if the presentation slides contain any personal particulars, records and personal data (as defined in the Personal Data (Privacy) Ordinance) such as names, email addresses, photos of students, etc, the author(s) have/has obtained the corresponding person's consent.
Similar to Cross language alignments - challenges guidelines and gold sets (20)
This paper is the result of collaboration between two projects: Emocionário and eSPERTo.
Emocionário aims at organizing emotions in Portuguese and annotate them in corpora. eSPERTo is a paraphrasing system that uses the NooJ linguistic engine, grammars, and lexicons.
The aims for this collaboration were fivefold:(i) From the Emocionário’s point of view, it would be very useful to have an emotion paraphraser to help us identify more cases of emotions in our corpora; (ii) while from eSPERTo’s point of view adding emotion paraphrases would considerably enhance its paraphrasing power. (iii) Applying the emotion classification to an hitherto not used application domain would be a good way to evaluate Emocionário’s capabilities and shortcomings; (iv) and both projects would gain from learning more about real paraphrases of emotion in text. Finally, (v) an interesting question is to assess how good is the methodology employed to harvest emotion paraphrases from parallel text.
This paper presents a comparative study of alignment pairs, either contrasting expressions or stylistic variants of the same expression in the European (EP) and the Brazilian (BP) varieties of Portuguese. The alignments were collected semi-automatically using the CLUE-Aligner tool, which allows to record all pairs of paraphrastic units resulting from the alignment task in a database. The corpus used was a children’s literature book "Os Livros Que Devoraram o Meu Pai" (The Books that Devoured My Father) by the Portuguese author Afonso Cruz and the Brazilian adaptation of this book. The main goal of the work presented here is to gather equivalent phrasal expressions and different syntactic constructions, which convey the same meaning in EP and BP, and contribute to the optimisation of editorial processes compulsory in the adaptation of texts, but which are suitable for any type of editorial process. This study provides a scientific basis for future work in the area of editing, proofreading and converting text to and from any variety of Portuguese from a computational point of view, namely to be used in a paraphrasing system with a variety adaptation functionality, even in the case of a literary text. We contemplate “challenging” cases, from a literary point of view, looking for alternatives that do not tamper with the imagery richness of the original version.
O presente estudo propõe uma análise comparativa –linguística, mas também literária e cultural – entre as edições portuguesa e brasileira de uma obra de literatura infantojuvenil – Os Livros que devoraram o meu pai, do autor português Afonso Cruz –que integra as listas de leituras sugeridas, tanto nos planos curriculares de Portugal como do Brasil. O objetivo específico é apresentar e discutir uma seleção de unidades lexicais, locuções e estruturas frásicas com função adjetiva em alternância nas duas variedades – ou seja, entre as escolhas do autor na variedade PE e as correspondentes soluções adotadas na versão PB. A metodologia escolhida centra-se na análise linguística contrastiva posta em prática com o auxílio de ferramentas digitais baseadas no projeto eSPERTo com recurso a alinhamentos semiautomáticos usando a ferramenta CLUE-Aligner (REF). O corpus utilizado é composto pelas edições portuguesa e brasileira da obra em estudo. O objetivo geral deste trabalho é otimizar os processos editoriais necessariamente presentes na adaptação dos textos, assim como fazer o levantamento das principais dificuldades desse processo. Isso implica, entre outras coisas, uma tomada de consciência face aos limites impostos por um texto literário, como a ténue fronteira entre a adaptação indispensável e a intervenção excessiva. Partindo dos resultados alcançados, pretendemos ainda incentivara investigação de recursos linguísticos para os propósitos de edição, revisão e ensino de Português língua materna e/ou língua estrangeira, entre outras aplicações.
This document provides an introduction and welcome message from the local organizers of the 3rd annual enetCollect MC meeting being held in Lisbon, Portugal. The summary includes:
1) The organizers thank the speakers, chairs, members, volunteers, and sponsors for their contributions to the meeting.
2) They introduce the official host, Professor Isabel Trancoso, and provide details on her extensive experience and leadership roles in spoken language processing.
3) The organizers conclude by thanking everyone for their participation in the meeting in Lisbon.
This document discusses using syntactic-semantic analysis for information extraction in biomedicine. It aims to extract biomolecular events like phosphorylation from text. It uses dictionaries of entities and verbs associated with event types, and NooJ grammars to identify events. Evaluation on a shared task dataset shows average recall of 36.76% and precision of 65.58% for six event types. While results are promising, it discusses limitations like manual pattern identification and challenges with more complex event constructions.
This presentation addresses the problem of translating SVC, such as fazer uma operação (to make an operation). In particular, it focus on the MT of biomedical-related SVC. It argues that paraphrasing can help translate these MWE with a higher quality. This work is based on my PhD research, which addressed the problem of paraphrasing and translating SVC in general.
ReWriter uses linguistically based automated paraphrasing and text-editing mechanisms to help users with their writing needs by providing suggestions for customized text authoring. It also generates word and phrasal usage data to help guide decision-making. ReWriter can be used in word processing applications or linguistic quality control for both source and target texts and it is a useful pre-editor for machine translation. The linguistic resources behind ReWriter, the paraphrasing grammars, and the tools from which ReWriter was derived will also be described, in this particular case, we illustrate ReWriter as a tool to process legal language.
Poster presented at the 2nd meeting of the COST Action CA16105 - enetCollect : European Network for Combining Language Learning with Crowdsourcing Techniques, which took place at Alexandru Ioan Cuza University, in Iasi, Romania.
The poster shows how chatbots can play an important role in Language Learning applications.
This paper presents the automation process of paraphrasing and converting Portuguese constructions typical of informal or spoken language into a formal written language. We illustrate this automation process with examples extracted from the e-PACT corpus that involve the placement of clitic pronouns in verbal compound contexts. Our task consists in paraphrasing and normalizing, among others, constructions such as "vou-lhe/posso-lhe fazer uma surpresa" into "vou/posso fazer-lhe uma surpresa" `lit: I will/can\_to him/her make a surprise / I will/can make\_to him/her a surprise; I will/can make him/her a surprise', where the clitic pronoun "lhe" migrates from an enclitic position after the first verb of the verbal compound to an enclitic position after the main verb, which is the verb responsible for the selection of that pronominal argument. The first verb is either an auxiliary verb or a volitive verb, e.g. "querer" `want'. This is a standard revision procedure in EP. Cases like this represent linguistic phenomena where in general language students and language users get confused or stumble. The paper focuses on general language where the phenomena being observed occur, describes examples of interest found in the corpus, and presents an automatic solution for the normalization of informal syntactic inadequacies found in the researched structures into standard formal writing structures through the application of very generic transformational grammars.
This paper presents the alignment of verbal predicate constructions with the clitic pronoun "lhe" in the European (EP) and Brazilian (BP) varieties of Portuguese, such as in the sentences "Já lhe} arrumaram a bagagem" | "Sua bagagem está seguramente guardada" 'His baggage is safely stowed away', where the EP dative proclisis "lhe" contrasts with the BP possessive pronoun "sua". We have selected several different paraphrastic contrasts, such as proclisis and enclisis, clitic pronouns co-occurring with relative pronouns and negation-type adverbs, among other constructions to illustrate the linguistic phenomenon. Some differences correspond to real contrasts between the two Portuguese varieties, while others purely represent stylistic choices. The contrasting variants were manually aligned in order to constitute a gold standard dataset, and a typology has been established to be further enlarged and made publicly available. The paraphrastic alignments were performed in the e-PACT corpus using the CLUE-Aligner tool. The research work was developed in the framework of the eSPERTo project.
This paper performs a detailed analysis on the alignment of Portuguese contractions, based on a previously aligned bilingual corpus. The alignment task was performed manually in a subset of the English-Portuguese CLUE4Translation Alignment Collection. The initial parallel corpus was pre-processed and a decision was made as to whether the contraction should be maintained or decomposed in the alignment. Decomposition was required in the cases in which the two words that have been concatenated, i.e., the preposition and the determiner or pronoun, go in two separate translation alignment pairs (PT- [no seio de] [a União Europeia] EN- [within] [the European Union]). Most contractions required decomposition in contexts where they are positioned at the end of a multiword unit. On the other hand, contractions tend to be maintained when they occur at the beginning or in the middle of the multiword unit, i.e., in the frozen part of the multiword (PT- [no que diz respeito a] EN- [with regard to] or PT- [além disso] EN-[in addition]. A correct alignment of multiwords and phrasal units containing contractions is instrumental for machine translation, paraphrasing, and variety adaptation.
O documento descreve o sistema eSPERTo, que gera paráfrases para edição e revisão de texto. O objetivo principal do projeto é desenvolver um sistema capaz de identificar e gerar paráfrases para melhorar a compreensão, simplificar a linguagem e auxiliar na aprendizagem da língua portuguesa. O sistema pode ser útil em vários ambientes como educação, jornalismo e tradução.
ReEscreve (in English, ReWriter) is a multi-purpose paraphraser that uses grammar-based paraphrasing capabilities suitable for source and target control (pre- and post-editing) and is useful for human and machine translation.
Spoken Language Systems Lab @ INESC-ID poster presented at the 1st meeting of the COST Action CA16105 - enetCollect : European Network for Combining Language Learning with Crowdsourcing Techniques, which took place at Eurac Research in Bolzano, Italy.
This presentation describes the integration of lexicon-grammar of predicate nouns with the support verb "fazer" ("to do" or "to make") into Port4NooJ, the Portuguese language module for NooJ. Port4NooJ resources are used by eSPERTo system to generate paraphrases, i.e., alternative ways to say or write the same sentence.
Non-adjacent linguistic phenomena such as non-contiguous multiwords and other phrasal units containing insertions, i.e., words that are not part of the unit, are difficult to process
and remain a problem for NLP applications. Non-contiguous multiword units are common across languages and constitute some of the most important challenges to high quality machine
translation. This paper presents an empirical analysis of non-contiguous multiwords, and highlights our use of the Logos
Model and the Semtab function to deploy semantic knowledge to align non-contiguous multiword units with the goal to translate these units with high fidelity. The phrase level manual
alignments illustrated in the paper were produced with the CLUE-Aligner, a Cross-Language Unit Elicitation alignment tool.
This presentation describes the integration of paraphrases of human intransitive adjectives (of disease, membership, nationality and generic human adjectives) in the eSPERTo paraphrasing system, a linguistically enhanced paraphrase generator that enables conversion of semantically equivalent phrases, and sentences based on semantico-syntactic patterns and multiword units, sensitive to context. eSPERTo is meant to be an hybrid system, combining statistics and linguistic knowledge to identify and generate new and more complex paraphrases and exploit existing paraphrasing resources. This system is integrated in an interactive application that helps users in producing and revising their texts. Among other functionalities, eSPERTo’s web platform includes text-editing mechanisms that provide a variety of alternatives for each expression.
We used the Portuguese linguistic resources of Port4NooJ (the Portuguese module) enhanced with the distributional properties of the human intransitive adjectives described in Lexicon-Grammar tables and applied to grammars to generate paraphrases, invoking NooJ's linguistic engine (noojappy). The new integrated properties allowed to generate several new transformations, namely: (i) relate adjective, noun and verb related constructions; (ii) adjective constructions supported by different copulative verbs; (iii) constructions involving nationality and other membership relations; (iv) cross-constructions; (v) appropriate noun constructions; (vi) generic noun phrases.
More from INESC-ID (Spoken Language Systems Laboratory - L2F) (20)
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Best 20 SEO Techniques To Improve Website Visibility In SERP
Cross language alignments - challenges guidelines and gold sets
1. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 1
Cross-Language Alignments:
Challenges, Guidelines and Gold Sets
Anabela Barreiro Luísa Coheur Tiago Luís
Ângela Costa Fernando Batista João Graça
2. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 2
Outline – Part 1
• Word alignment
• Basic concepts
• Applications
• State of the art
• Limitations
• Paraphrase alignment
• Multiword, meaning and translation unit alignment: importance
• Our task
• Alignment tool: CLUE-Aligner
3. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 3
Outline – Part 2
• General annotation guidelines
• Cross-linguistic major challenges to word alignment
• Annotation guidelines for multiword units and lexical and non-lexical
realization phenomena
• Pro-dropping
• Articles and zero articles
• Examples: continuous multiword units
• Examples: continuous and discontinuous support verb constructions
Preposition-dependency
(V, N and Adj)
Active vs passive Choice of noun pre-modifiers Different PoS with same
semantics (V vs process N)
Noun adjuncts Coordination Anaphora: choice of co-
referents
Impersonal constructions
Contractions Style Antonyms and negation
constructions
Romance languages double
negation
Singular vs plural idiomatic vs non-idiomatic Flexible/loose paraphrasing
constructions;
Idiosyncrasies of each
language
4. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 4
Outline – Part 3
• Our contribution
• Annotation process
• Preliminary results
• Discussion
• Future work
5. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Word Alignment: Basic Concepts
• Objects representing the mapping of words (or expressions),
which are semantically equivalent in a source and a target
sentence of a parallel corpus [Brown at al., 1990]
– Matrix of n * m entries, where n is a position on the source sentence, and
m is a position on the target sentence. An entry in that matrix an,m
specifies if the word at position n is part of a translation of the word at a
position m on the target language
• Task of word alignment - identifying translational equivalences
(= semantic correspondences) in the aligned sentence pairs of
a parallel text [Hearne & Way, 2011]
• Translational equivalences - graphically represented in a grid
by the intersection of single segments (individual words) or
blocks (semantico-syntactic units, phrases, expressions)
5
6. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Word Alignment: Basic Concepts
6
• Sure alignment (S-alignment)
– Unambiguous and valid in all contexts
• EN system
• ES sistema
• FR système
• PT sistema
• Possible alignment (P-alignment)
– Ambiguous and invalid in some contexts
• EN be
• ES ser/estar/haber/existir
• FR être/avoir/exister
• PT ser/estar/haver/existir
7. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Word Alignment: Applications
• Statistical machine translation
– [Brown et al., 1990] – statistical machine translation
– [Och and Ney, 2004] – phrase base machine translation
– [Galley et al., 2004] – syntax base machine translation
• Annotations’ projections
• Extraction of bilingual lexica
• Evaluation of machine translation systems
7
8. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Word Alignment: State of the Art
• Workshops and evaluation tasks (multi-language)
– http://www.cse.unt.edu/~rada/wp/
– http://www.statmt.org/wpt05
– http://www.lpl.univ-aix.fr/projects/arcade
• Projects
– Blinker project –French-English
http://nlp.cs.nyu.edu/blinker/
• Guidelines
[Melamed, 1998] [Och and Ney, 2000]
[Lambert et al., 2005] [Kruijff-Korbayová et al., 2006]
[Graça et al., 2004]
8
9. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Word Alignment: Limitations
• Language does not operate on a word-for-word basis
• A large number of words are undissociated
– Multiword units
• [Gross and Senellart, 1998] - +40% of 1 year of Le Monde are MWU
• [Sag et al., 2002] – 50-70% of specialized lexica are MWU
• [Ramisch et al., 2010] – 56.7% of terms in Genia corpus have 2+
words (not included general purpose MWU, e.g., generic compounds,
lexical bundles, phrasal verbs, fixed expressions, which also occur in
domain-specific texts)
– Translation units
– Meaning units
– Paraphrases
• Segment and block alignment (sure and possible)
9
10. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Example: Segment and Block
Alignment (Sure and Possible)
10
11. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Paraphrase Alignment
• Monolingual
– [Callison-Burch et al., 2006]
• Annotation guidelines for paraphrase alignment
• Paraphrases - sentences that convey the same meaning but are
worded differently
• Alignment of words, phrases, expressions, within the same language
• Bilingual = (non-literal) translation
– Need to account for paraphrases across languages
11
12. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Multiword, Meaning and Translation
Unit Alignment: Importance
• Publicly available manual word alignments are restricted
to a few language pairs
• Manual word alignments are a desired resource
– Evaluation of word alignment algorithms
– Training of supervised and semi-supervised algorithms
– Tuning of parameters for different types of model
• But, “name”, “concept” and “techniques” of alignment need
to be linguistically sophisticated to be more useful and
help provide improved machine translation!
12
13. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Our Task
• EuroParl corpus [Koehn, 2005]
• 6 gold alignments sets
– 400 alignments each set (400x6=2,400)
• Languages: English, French, Portuguese and Spanish
– Language pairs: [en-es], [en-fr], [en-pt], [es-fr], [pt-es], [pt-fr]
• Guidelines for multi-language manual word annotations
(with inter-annotator agreement)
• Linguistically-informed (and linguistically-motivated) cross-
language multiword unit and paraphrase alignment
(translation unit alignment)
13
14. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
CLUE-Aligner Alignment Tool
14
CLUE-Aligner =
Cross-Language Unit Elicitation Aligner
• Helps reduce ambiguity in the alignment process
• Facilitates the alignment of translation units
15. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Major Challenges (4 different classes)
• semantico-discursive
– emphatic linguistic constructions
• tautology
• pleonasm and repetition
• focus constructions
• lexical and semantico-syntactic
– multiword units
– compound verbs
– prepositional predicates
15
16. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Major Challenges (4 different classes)
• morphological
– contracted forms
– lexical versus non-lexical realization
• articles and zero articles
• pro-dropping
– subject pronoun drop
– empty relative pronoun
• morpho-syntactic
– free noun adjuncts
16
17. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Linguistic phenomenon No alignment P-alignment
Incomplete or non-translation X
Incorrect translation and typo X*
Approximate correspondence (numeric) X
Non-obligatory
linguistic structure
Pleonasm X
Repetition of words or expressions X
Redundancy or additional/extra information X
Mismatching pronoun, determiner, verbs, etc. X
Abbreviations versus full word X
Punctuation mark
Different but correct X
Incorrect / mismatch X
Missing X
17
General Annotation Guidelines
* If a multiword unit is incorrectly translated or contains a typo, none of its internal segments are aligned
18. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Linguistic phenomenon No alignment Block-alignment
S-align P-align
Multiword Unit
continuous X X
discontinuous X*
Lexical
versus
non-lexical
realization
article+ N
versus
zero-article + N
Ø people
=
PT - as pessoas
X
Pro-drop + V
versus
pronoun + V
I went
=
PT - Ø fui
X
Empty relative pronoun
versus
realized relative pronoun
N that I met = N I met
=
PT - que (eu) conheci
X
Relative
versus
participial adjective
that was writen = writen
=
PT – escrito
X
18
Annotation Guidelines
* Some discontinous multiword units are candidates for block-alignment (e.g., when the number of inserts is small or the multiword unit
is “semi-frozen”
19. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Continuous multiword units Block-S-alignment Block-P-alignment
Support verb construction X X
Compound X X
Phrasal verb X X
Named entity X X
Date and time expression X
Lexical bundle X
Idiomatic expression X
Domain term X
French negation (ne pas) X
English infinitive (to + V) X X
19
Annotation Guidelines
[Barreiro, 2008] presents a detailed description and examples of the different types of multiword unit
20. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Example: Continuous Support Verb
Constructions (alignment)
20
ES aprueba plenamente
FR approuve pleinement
21. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Example: Discontinuous Support Verb
Constructions (no alignment)
21
ES para que acelere la directiva sobre pensiones
complementares
FR pour faire avancer la directive sur les pensions
complementaires
22. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Prepositional predicates
EN I too should like to congratulate [NE] on his excellent report
ES también yo quisiera felicitar a mi colega [NE] por su excelente informe
FR je voudrais féliciter moi aussi mon collègue [NE] pour son excellent
rapport
PT também eu gostaria de felicitar o meu colega [NE] pelo seu excelente
relatório
EN […] our Asian partners prefer to deal with questions which unite us
ES […] nuestros socios asiáticos prefieren dedicarse a las questiones que
nos unen
FR […] nos partenaires asiatiques préfèrent s’attacher à ce qui nous unit
PT […] os nossos parceiros asiáticos preferem centrar-se unicamente nas
questões comuns
22
Segment S-alignment
Impossible to annotate discontinuous preposition-dependency
Block P-alignment
23. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
agree with belong to forgive s/o for pay for stand for
aim at/for choose between hope for prepare for thank s/o for
allow for comment on insist on prevent s/o from think of/about
apologise for compare with interfere with/in provide s/o with volunteer to
apply for complain about joke about refer to wait for
approve of concentrate on laugh at rely on warn s/o about
argue with/about congratulate on lend s/th to s/o run for worry about
ask for consist of listen to smile at
attend to deal with long for succeed in
believe in decide on object to suffer from
Cross-Linguistic Challenges
• Prepositional verbs
23
24. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Prepositional nouns
24
attack on attitude towards in agreement on strike
cruelty towards comparison between on average in trouble
difficulty in/with decrease in on condition on behalf of
knowledge of disadvantage of delay in connection between
reason for incerase in in doubt difference between/of
rise in preference for information about under guarantee
solution to reduction in need for in power
use of at risk protection from reaction to
in a hurry at stake report on result of
in practice in theory room for trouble with
25. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Prepositional adjectives
25
delighted at/about frightened of opposed to similar to
different from friendly with pleased with sorry for/about
dissatisfied with good at popular with suspicious of
doubtful about guilty of proud of sympathetic to(wards)
enthusiastic about incapable of puzzled by/about tired of
envious of interested in safe from typical of
excited about jealous of satisfied with unaware of
famous for keen on sensitive to(wards) used to
fed up with kind to serious about
fond of mad at/about sick of
26. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Noun Adjuncts
– Compounds
• European investment bank banco europeu de investimento
[Adj N N] [N Adj [de N]]
– Free noun phrases (not compounds)
• presidency communication comunicação da presidência
[N N] [N [de N]]
26
Block S-alignment
Segment S-alignment
Block-P-alignment
of [de N]
27. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Contractions
– two or more words with different parts-of-speech overlap, which
makes syntactic analysis and generation difficult
– in cross-language analysis, the contrast between languages that
have contractions and languages that do not have them, or do not
have them in the same contexts, presents additional difficulties
– The alignment of one segment that corresponds to a contracted form
in one language with the corresponding segments where elements
are not contracted in the other language of the parallel pair is
pragmatically motivated
27
28. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Example: Contractions (block-P-
alignment)
28
Interference with the support verb construction
EN to make a reference to
PT fazer uma referência a
29. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Example: Contractions (block-P-
alignment)
29
Interference with the support verb construction
ES hacer una referencia a
FR faire référence a
30. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Singular versus plural (related to determiner)
EN in every official language of the union
ES en todos los idiomas oficiales de la unión
FR dans toutes les langues officielles de l'union
PT em cada uma das línguas oficiais da união
• Active versus passive
EN before new member states are admitted
ES antes de la incorporación de nuevos miembros
FR avant l'admission de nouveaux membres
PT antes da entrada de novos membros
30
Block or segment
P-alignment
Block-S-alignment if there
is some fixedness
(such as in this case)
Block P-alignment
31. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Coordination
EN which we will send to the council and Ø parliament
ES que enviaremos al consejo y al parlamento
FR qui sera envoyée au conseil et au parlement
PT que remeterá ao conselho e ao parlamento
• Style: idiomatic versus non-idiomatic
EN which began four years ago
ES que empezó hace quatro años
FR qui a vu le jour il y a quatre ans
PT que se iniciou há quatro anos
31
No alignment
Block P-alignment
32. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Choice of noun pre-modifiers
EN we should use that public funding for those types of project which are
most difficult to finance through the private sector
ES deberíamos utilizar esa financiación pública para aquel tipo de proyectos
que tienen mayor dificuldad para ser financiados por el sector privado
FR nous devrions recourir au financement public pour les projets que le
secteur privé boude
PT o financiamento público deveria ser utilizado para os projectos que
registam maiores dificuldades em serem financiados pelo sector privado
32
Block P-alignment
EN despite certain difficulties
PT apesar das dificuldades
33. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Anaphora - choice of co-referents (noun versus pronoun)
EN it is not acceptable that we assisted Korea during the Asean crisis by
means of IMF loans and suchlike, only for Korea still to be subsidising its
shipyards
EN no resulta procedente que hayamos ayudado a Corea en la crisis de la
Asean a través de préstamos del FMI, etc. y que Corea siga
subvencionando sus astilleros
FR il n’est pas acceptable que nous ayons aidé la Corée dans la crise de
l’Anase, avec des prêts du FMI, etc. et qu’elle continue à subventionner
ses chantiers navals
PT é inadmissível que, depois de termos ajudado a Coreia, através de
créditos do FMI, etc., na crise da Asean, este país continue a
subvencionar agora os seus estaleiros navais
33
Segment or block
P-alignment
34. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Antonyms and negation constructions
EN the countries of Asia have not unfortunately been in favour of that
proposal
ES los países de Asia desgraciadamente no han sido favorables a dicha
propuesta
FR les pays d'Asie ont malheureusement rejeté cette proposition
PT os países da Ásia, infelizmente, não se mostraram favoráveis a esta
proposta
34
Block S-alignment together
with adverb
(insert in EN and FR)
35. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Flexible/loose paraphrasing constructions
EN and we shall vote against it
ES y merece nuestra condena
FR et dénonçons
PT e merece a nossa condenação
EN 1993 was a significant year
ES el año 1993 es una fecha notable
FR l’année 1993 est à marquer d’une pierre blanche
PT 1993 é uma data charneira
35
Block P-alignment
36. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Different parts-of-speech with same semantics (verbs versus
process nouns)
EN we must use all the financial instruments at our disposal to rapidly
develop the market
ES es preciso utilizar todos los instrumentos financieros disponibles para un
rápido desarollo ulterior del mercado
FR il faut utiliser tous les instruments financiers disponibles pour
développer rapidement le marché
PT todos os instrumentos financeiros disponíveis deverão ser aplicados
para continuar a desenvolver rapidamente o mercado
36
Block S-alignment (with internal segment P-alignments)
EN and PT :
Segment S-alignment
No alignment of [continuar a]
37. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Impersonal constructions
(+ “impersonal” relative versus participial adjective)
EN we must fully support the demands that have been made
ES hay que apoyar plenamente las exigencias que se han formulado
FR il faut par conséquent appuyer les requêtes formulées
PT as reivindicações formuladas deverão ser plenamente apoiadas
37
Block P-alignment
Internal P-alignment
EN we must
ES hay que
FR il faut
Internal segment S-alignment - adverb and verb (EN, ES, FR)
Internal segment P-alignment - verb (PT)
38. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Romance languages double negation (+ coordination)
EN it is not, therefore, surprising that there is, in this context, no real
integration or gennuine political dialogue
ES no es nada sorprendente, entonces, que en ese contexto, no haya ni
verdadera integración ni verdadero diálogo político
FR rien d’étonnant donc, qu'il n'y ait dans ce contexte, ni intégration
véritable, ni dialogue politique véritable
PT assim, não é de espantar que, nesse contexto, não exista verdadeira
integração nem verdadeiro diálogo político
38
Block P-alignment of the relative existential with adverbial (insert)
EN that there is, in this context, no
ES que en esse contexto, no haya
FR qu’il n’y ait dans ce contexte
PT que, nesse contexto, não exista
Segment P-alignment of negation
and negation connector
EN no – or
ES ni – ni
FR n’ – ni
PT Ø - nem
39. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Cross-Linguistic Challenges
• Idiosyncrasies of languages
• Portuguese inflected infinitive (peculiar verb tense)
• English to+Infinitive
• French negation
• English apostrophe
• …
• Sociolinguistic differences
39
40. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Our Contribution
• Tool CLUE-Aligner
• Annotated corpora
• Cross-language resources – gold collection
Publicly available on the META-NET website:
http://metanet4u.l2f.inesc-id.pt/
• Guidelines
– http://www.inesc-id.pt/ficheiros/publicacoes/8204.pdf
40
41. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Annotation Process
• Annotation of 400 x 6 (2,400 sentence alignments) by a
linguist
• Alignment on a subset of by a second linguist (25
• sentences of the English-Portuguese language pair)
• Inter-annotators agreement
41
42. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Preliminary Results
42
language words avg. words
en 11158 27.9
es 11664 29.2
fr 12464 31.2
pt 11649 29.1
pair Sure Possible Total
en-pt 6684 418 7102
en-fr 7025 569 7594
en-es 7636 399 8035
es-fr 7477 767 8244
pt-es 7958 557 8515
pt-fr 7029 782 7811
pair Sure Possible Total
en-pt 2588 602 3190
en-fr 3865 414 4279
en-es 3551 351 3902
es-fr 3516 495 4011
pt-es 3162 382 3544
pt-fr 3253 698 3951
Block (MWU) alignmentSegment (word) alignment
43. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Inter-annotators Agreement
43
• Statistical significance for kappa is rarely reported. However, a
number magnitude guidelines have appeared in the literature.
– Landis & Koch (1977) consider
• kappas between .4 and .6 as a moderate agreement
• kappas between .8 and 1 correspond to an almost perfect agreement
– Fleiss (1981) (equally arbitrary guidelines) characterize
• kappas from .40 to .75 as fair to good
• kappas over .75 as excellent
• This set of guidelines is however by no means universally accepted
Cohen's kappa
coefficient
Multi-word units (MWU) 0.541
Word alignments (WA) 0.984
Total 0.871
44. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Discussion
• Difficulties in analyzing fluency, stylistics (including word order),
paraphrase, etc.
• Alignments do not always work bi-directionally (sometimes the source-
target direction for a language pair matters)
• Levels of alignment and ranking systems (n-grams, morphology,
semantico-syntactic level, phrase, paraphrase, etc.)
• Terminology imprecision is found in corpora (it leads to poor quality
machine translation)
45
45. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Future Work
• Integration of lexica (multiword units, etc.) obtained via the use of local
grammars – use multiword units as ONE (1) segment of alignment,
whenever that is possible (contiguous, etc.)
• Pre-processing of contractions and post-processing of elements that
need to be contracted is important if applied to machine translation or
to create “more polished” lexica
• Evaluation of the current alignments in a statistical machine translation
system to see if translation quality improves
46
46. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Future Work
• Machine learning of recognition and alignment of multiword units
• based on segment alignments, i.e., individual words inside the
multiword unit
• based on multiword units of a parallel sentence in another language or
language pair alignment
• Use of local grammars that identify and process discontinuous
multiword units and other complex linguistic phenomena to combine
with word alignment techniques – how to combine?
47
47. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory
Main Conclusion
• Bringing linguistics into STM at the start is the first inevitable place
where hybridization should be possible.
• We believe that it would be productive to convert texts on both sides of
a translation pair into a common semantico-syntactic
representation before applying statistics into them. For this, each
language would have to have a parser capable of producing
homogeneous output.
• If this common representation were available, that would bring vast
possibilities for multi-linguistic SMT.
48
48. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 49
technology
from seed
L2 F - Spoken Language Systems Laboratory
Thank you!
49. Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology
from seed
L2 F - Spoken Language Systems Laboratory 50
Cross-Language Alignments:
Challenges, Guidelines and Gold Sets
Anabela Barreiro Luísa Coheur Tiago Luís
Ângela Costa Fernando Batista João Graça
Editor's Notes
Antes de iniciar esta apresentação gostaria de agradecer à Priberam a oportunidade de mostrar o nosso trabalho neste seminário. Andnow, I willproceed in English…Goodmorning. Mynameis Anabela Barreiro. I amaninvitedresearcherattheSpoken Language Systems Laboratory, at INESC-ID, Lisbon. Today, I willpresent “Cross-Language Alignments: Challenges, Guidelines and Gold Sets”, done in collaborationwithmycolleagues Luísa Coheur, Tiago Luís, Ângela Costa, Fernando Batistaand João Graça.In thispresentation, I willdescribe the key cross-language annotation guidelines to provide support for machine translation systems. The guidelines aim at improving the quality of the machine translation output by using linguistically-informed and motivated annotation of special case multiwords and semantico-syntactic translation units.
This presentation is divided in 3 parts.I will describe CLUE-Aligner, a tool developed to reduce ambiguity in the alignment process and facilitate the alignment of meaning and translation units.
I will focus on the challenges to the alignment of special cross-linguistic cases, such as multiword units, lexical and non-lexical realization (the pro-drop phenomenon, determiners and zero determiners), noun adjuncts, and idiosyncrasies of each language.
Themain use ofwordalignmentsis SMT.[Brown et al., 1990] – introducedtheconceptofwordalignmentandapplieditdirectly to a SMT system[OchandNey, 2004] – usedit as a primaryresource for phrase base machinetranslation[Galleyet al., 2004] – usedit as a resource for syntax base machinetranslation
In thelastyears, withtheincreaseoffreelyavailableparallel corpora, a hugedevelopmenttookplace in SMT.Many workshops andevaluationtaskshavebeendedicated to multi-languagewordalignment.Some projects too. For example,theBlinkerprojectaimedataligningwordsbetweenFrenchandEnglishtexts.Manywordalignmentguidelineshavebeensuggested.
Despitethegrowing # ofavailablemulti-languagesentencealignedparallel corpora andalignmenttools, the # ofpubliclyavailable manual wordalignmentsisrestricted to a fewlanguagepairs.Word alignmentis a desirableresource.
The guidelines were based on the alignment of bilingual texts of the common test set of the publicly available Europarl corpus thatcontainsproceedingsoftheEuropeanParliament in thedifferentofficiallanguagesofthe EU. Theworkprovides 6 goldalignment sets. The bilingual texts cover all possible combinations between the English, Spanish, French, and Portuguese languages.
CLUE-Aligner, a tool developed to reduce ambiguity in the alignment process and facilitate the alignment of meaning and translation units.
Onlywhenoneoftheseelementsiselidediswhenwe use blockalignments. Whentheelements are lexicallyrealized, determiners,pronounsandthe individual elementsoftherelatives are single alignedwiththecorrespondingelements in theparallelsentenceExceptions:Discontinuousmultiwordunitswith a smallnumberofinserts are aligned
Otherexamplesofalignedandnotaligned MWUPhrasalverbsAligned – look intotheproblem – debruçar-se sobre este problemaNotaligned – (230)VerbcompoundsAligned – hasalsoincreased (22)Notaligned - FrenchnegationAligned – nepasNotaligned -
(da presidênciais S-alignedwithpresidency)Presidencycommunicationis in the corpus – butit does notsoundright!
NOT A GOOD SOLUTION – it does notaccount for thedoublenegationstructure
The gold collection and alignment tool are publicly available.