This presentation describes the integration of paraphrases of human intransitive adjectives (of disease, membership, nationality and generic human adjectives) in the eSPERTo paraphrasing system, a linguistically enhanced paraphrase generator that enables conversion of semantically equivalent phrases, and sentences based on semantico-syntactic patterns and multiword units, sensitive to context. eSPERTo is meant to be an hybrid system, combining statistics and linguistic knowledge to identify and generate new and more complex paraphrases and exploit existing paraphrasing resources. This system is integrated in an interactive application that helps users in producing and revising their texts. Among other functionalities, eSPERTo’s web platform includes text-editing mechanisms that provide a variety of alternatives for each expression.
We used the Portuguese linguistic resources of Port4NooJ (the Portuguese module) enhanced with the distributional properties of the human intransitive adjectives described in Lexicon-Grammar tables and applied to grammars to generate paraphrases, invoking NooJ's linguistic engine (noojappy). The new integrated properties allowed to generate several new transformations, namely: (i) relate adjective, noun and verb related constructions; (ii) adjective constructions supported by different copulative verbs; (iii) constructions involving nationality and other membership relations; (iv) cross-constructions; (v) appropriate noun constructions; (vi) generic noun phrases.
Emily Pitler - Representations from Natural Language Data: Successes and Chal...MLconf
This document discusses successes and challenges in natural language processing. It summarizes recent advances in language representation models including BERT and Transformers. However, it notes that models can struggle with realistic "in-the-wild" inputs that differ from their training data, such as code-mixed text with multiple languages or identifying verbal commands. The document advocates for further work addressing mismatches between training and real-world inputs to improve the robustness of NLP models.
Non-adjacent linguistic phenomena such as non-contiguous multiwords and other phrasal units containing insertions, i.e., words that are not part of the unit, are difficult to process
and remain a problem for NLP applications. Non-contiguous multiword units are common across languages and constitute some of the most important challenges to high quality machine
translation. This paper presents an empirical analysis of non-contiguous multiwords, and highlights our use of the Logos
Model and the Semtab function to deploy semantic knowledge to align non-contiguous multiword units with the goal to translate these units with high fidelity. The phrase level manual
alignments illustrated in the paper were produced with the CLUE-Aligner, a Cross-Language Unit Elicitation alignment tool.
The document describes OpenLogos, an open-source machine translation system that uses knowledge-rich bilingual dictionaries. These dictionaries contain extensive semantic and syntactic information for entries using the Semanic-Syntactic Abstraction Language (SAL). Three English-to-other language dictionaries were created for research purposes containing over 80,000 entries each. The goal is to make the lexical resources freely available to help develop new NLP tools, especially for under-resourced languages.
O documento discute o contributo das tecnologias da linguagem para a globalização do português. Descreve projetos do Laboratório de Sistemas de Língua Falada (L2F) em áreas como transcrição multimédia, ensino à distância, saúde à distância e tradução automática, e os desafios associados a cada um.
1) The document discusses a linguistic evaluation of support verb constructions performed on the OpenLogos and Google Translate machine translation systems.
2) A corpus of 100 sentences containing support verb constructions was translated into several languages by each system and evaluated both quantitatively and qualitatively.
3) The evaluation found that OpenLogos translated more support verb constructions correctly thanks to its use of linguistic rules and representations, while Google Translate struggled more with non-contiguous and idiomatic constructions due to its statistical nature.
CLUE-Aligner is an interactive tool for annotating pairs of paraphrastic and translated linguistic units. It allows the alignment of contiguous and discontiguous multiword units through a matrix visualization. Alignments are classified as "sure" or "possible" based on criteria of optimal or approximate semantic equivalence. The tool was inspired by previous alignment applications but addresses current shortcomings like a lack of support for discontiguous multiwords. Future work includes using aligned units to train machine learning models for paraphrasing and machine translation applications.
Este documento describe brevemente cómo funcionan los agentes virtuales y el proceso de implementación de un bot conversacional. Explica que el bot analiza el mensaje del cliente, busca las palabras clave y realiza acciones predefinidas como consultar un CRM. Luego, arma una respuesta templatizada y la envía al cliente. También habla sobre integrar el bot con canales como chat, SMS y redes sociales.
This paper describes ReEscreve, a multi-purpose paraphraser that uses grammar-based paraphrasing capabilities suitable for source and target control (pre and post-editing) and useful for human and machine translation. At the current stage, ReEscreve transforms, with a 93.4% precision, support verb constructions into verbs or similar expressions, but it is being used to progressively paraphrase other linguistic phenomena enabling it to be used as an authoring and stylistic aid in word processing applications. ReEscreve is freely available on the Internet at: http://www.linguateca.pt/Reescreve/
Emily Pitler - Representations from Natural Language Data: Successes and Chal...MLconf
This document discusses successes and challenges in natural language processing. It summarizes recent advances in language representation models including BERT and Transformers. However, it notes that models can struggle with realistic "in-the-wild" inputs that differ from their training data, such as code-mixed text with multiple languages or identifying verbal commands. The document advocates for further work addressing mismatches between training and real-world inputs to improve the robustness of NLP models.
Non-adjacent linguistic phenomena such as non-contiguous multiwords and other phrasal units containing insertions, i.e., words that are not part of the unit, are difficult to process
and remain a problem for NLP applications. Non-contiguous multiword units are common across languages and constitute some of the most important challenges to high quality machine
translation. This paper presents an empirical analysis of non-contiguous multiwords, and highlights our use of the Logos
Model and the Semtab function to deploy semantic knowledge to align non-contiguous multiword units with the goal to translate these units with high fidelity. The phrase level manual
alignments illustrated in the paper were produced with the CLUE-Aligner, a Cross-Language Unit Elicitation alignment tool.
The document describes OpenLogos, an open-source machine translation system that uses knowledge-rich bilingual dictionaries. These dictionaries contain extensive semantic and syntactic information for entries using the Semanic-Syntactic Abstraction Language (SAL). Three English-to-other language dictionaries were created for research purposes containing over 80,000 entries each. The goal is to make the lexical resources freely available to help develop new NLP tools, especially for under-resourced languages.
O documento discute o contributo das tecnologias da linguagem para a globalização do português. Descreve projetos do Laboratório de Sistemas de Língua Falada (L2F) em áreas como transcrição multimédia, ensino à distância, saúde à distância e tradução automática, e os desafios associados a cada um.
1) The document discusses a linguistic evaluation of support verb constructions performed on the OpenLogos and Google Translate machine translation systems.
2) A corpus of 100 sentences containing support verb constructions was translated into several languages by each system and evaluated both quantitatively and qualitatively.
3) The evaluation found that OpenLogos translated more support verb constructions correctly thanks to its use of linguistic rules and representations, while Google Translate struggled more with non-contiguous and idiomatic constructions due to its statistical nature.
CLUE-Aligner is an interactive tool for annotating pairs of paraphrastic and translated linguistic units. It allows the alignment of contiguous and discontiguous multiword units through a matrix visualization. Alignments are classified as "sure" or "possible" based on criteria of optimal or approximate semantic equivalence. The tool was inspired by previous alignment applications but addresses current shortcomings like a lack of support for discontiguous multiwords. Future work includes using aligned units to train machine learning models for paraphrasing and machine translation applications.
Este documento describe brevemente cómo funcionan los agentes virtuales y el proceso de implementación de un bot conversacional. Explica que el bot analiza el mensaje del cliente, busca las palabras clave y realiza acciones predefinidas como consultar un CRM. Luego, arma una respuesta templatizada y la envía al cliente. También habla sobre integrar el bot con canales como chat, SMS y redes sociales.
This paper describes ReEscreve, a multi-purpose paraphraser that uses grammar-based paraphrasing capabilities suitable for source and target control (pre and post-editing) and useful for human and machine translation. At the current stage, ReEscreve transforms, with a 93.4% precision, support verb constructions into verbs or similar expressions, but it is being used to progressively paraphrase other linguistic phenomena enabling it to be used as an authoring and stylistic aid in word processing applications. ReEscreve is freely available on the Internet at: http://www.linguateca.pt/Reescreve/
This paper reports our first attempt of integrating eSPERTo’s paraphrastic engine, which is based on NooJ platform, with two application scenarios: a conversational agent, and a summarization system. We briefly describe eSPERTo’s base resources, and the necessary modifications to these resources
that enabled the production of paraphrases required to feed both systems. Although the improvement observed in both scenarios is not significant, we present a detailed error analysis to further improve the achieved results in future experiments.
This paper presents a methodology to extract a paraphrase database for the European and Brazilian varieties of Portuguese, and discusses a set of paraphrastic categories of multiwords and
phrasal units, such as the compounds toda a gente vs todo o mundo "everybody" or the gerundive constructions [estar a + V-Inf] vs [ficar + V-Ger] (e.g., estive a observar vs fiquei observando "I was observing"), which are extremely relevant to high quality paraphrasing. The variants were manually aligned in the e-PACT corpus, using the CLUE-Aligner tool. The methodology, inspired
in the Logos Model, focuses on a semantico-syntactic analysis of each paraphrastic unit and constitutes a subset of the Gold-CLUE-Paraphrases.1 The construction of a larger dataset of
paraphrastic contrasts among the distinct varieties of the Portuguese language is indispensable for variety adaptation, i.e., for dealing with the cultural, linguistic and stylistic differences between them, making it possible to convert texts (semi-)automatically from one variety into another, a
key function in paraphrasing systems. This topic represents an interesting new line of research with valuable applications in language learning, language generation, question-answering, summarization, and machine translation, among others. The paraphrastic units are the first resource of its kind for Portuguese to become available to the scientific community for research purposes.
The document provides an introduction to natural language processing (NLP), discussing key related areas and various NLP tasks involving syntactic, semantic, and pragmatic analysis of language. It notes that NLP systems aim to allow computers to communicate with humans using everyday language and that ambiguity is ubiquitous in natural language, requiring disambiguation. Both manual and automatic learning approaches to developing NLP systems are examined.
This document provides an introduction and overview of natural language processing (NLP). It discusses how NLP aims to allow computers to communicate with humans using everyday language. It also discusses related areas like artificial intelligence, linguistics, and cognitive science. The document outlines some key aspects of communication like intention, generation, perception, analysis, and incorporation. It discusses the roles of syntax, semantics, and pragmatics. It also covers challenges in NLP like ambiguity and how ambiguity is pervasive and can lead to many possible interpretations. The document contrasts natural languages with computer languages and provides examples of common NLP tasks.
Body-Part Nouns and Whole-Part Relations in PortugueseJorge Baptista
In this paper, we target the extraction of whole-part rela- tions involving human entities and body-part nouns in SYSTEM, a hy- brid statistical and rule-based Natural Language Processing chain for Portuguese. Whole-part relation is a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set.
Poster presented at the 2nd meeting of the COST Action CA16105 - enetCollect : European Network for Combining Language Learning with Crowdsourcing Techniques, which took place at Alexandru Ioan Cuza University, in Iasi, Romania.
This poster shows paraphrastic suggestions in the eSPERTo paraphrasing system applied to a QA application on a virtual agent and to a summarization tool. It also shows how paraphrases can be used in language learning and the tests envisaged to make eSPERTo a Portuguese learning tool.
This document provides an overview of natural language processing and planning topics including:
- NLP tasks like parsing, machine translation, and information extraction.
- The components of a planning system including the planning agent, state and goal representations, and planning techniques like forward and backward chaining.
- Methods for natural language processing including pattern matching, syntactic analysis, and the stages of NLP like phonological, morphological, syntactic, semantic, and pragmatic analysis.
Natural language processing (NLP) involves developing systems that allow computers to understand and communicate using human language. NLP aims to understand syntax, semantics, and pragmatics. It addresses challenges like ambiguity, where a sentence can have multiple possible meanings. Syntactic parsing is the process of analyzing a sentence's structure using a context-free grammar to produce a parse tree. Top-down and bottom-up parsing are two approaches to syntactic parsing where top-down starts with the start symbol and bottom-up starts with the sentence's terminal symbols.
Natural language processing (NLP) is focused on developing systems that allow computers to communicate with humans using everyday language. NLP involves computational methods to aid understanding of human language. Communication for both speakers and hearers involves processes like intention, generation, perception, analysis, syntactic interpretation, semantic interpretation, and pragmatic interpretation. Natural language is highly ambiguous and must be disambiguated at syntax, semantics, and pragmatics levels. Ambiguities compound and generate many possible interpretations. Both top-down and bottom-up parsing are used to analyze syntax, but explore search spaces differently.
This document provides an overview of syntax and its analysis. It discusses:
- What syntax is and why it is studied
- Applications of syntactic analysis like search, paraphrasing and information extraction
- The structure of words (morphology) and sentences (syntax) and their interplay
- Different representations of syntactic structure like trees and dependencies
- Context-free grammars and their use in syntactic analysis
- Representing syntactic information and constraints through attributes and unification
- Phenomena like structural priming and characteristics of spoken language syntax
This paper presents the alignment of verbal predicate constructions with the clitic pronoun "lhe" in the European (EP) and Brazilian (BP) varieties of Portuguese, such as in the sentences "Já lhe} arrumaram a bagagem" | "Sua bagagem está seguramente guardada" 'His baggage is safely stowed away', where the EP dative proclisis "lhe" contrasts with the BP possessive pronoun "sua". We have selected several different paraphrastic contrasts, such as proclisis and enclisis, clitic pronouns co-occurring with relative pronouns and negation-type adverbs, among other constructions to illustrate the linguistic phenomenon. Some differences correspond to real contrasts between the two Portuguese varieties, while others purely represent stylistic choices. The contrasting variants were manually aligned in order to constitute a gold standard dataset, and a typology has been established to be further enlarged and made publicly available. The paraphrastic alignments were performed in the e-PACT corpus using the CLUE-Aligner tool. The research work was developed in the framework of the eSPERTo project.
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
This document provides an introduction to natural language processing (NLP). It discusses key topics in NLP including languages and intelligence, the goals of NLP, applications of NLP, and general themes in NLP like ambiguity in language and statistical vs rule-based methods. The document also previews specific NLP techniques that will be covered like part-of-speech tagging, parsing, grammar induction, and finite state analysis. Empirical approaches to NLP are discussed including analyzing word frequencies in corpora and addressing data sparseness issues.
The document discusses challenges in cross-language word alignment. It outlines topics including word alignment concepts and applications, state of the art, and limitations due to phenomena like multiword units. Guidelines are presented for annotating alignments between English, French, Portuguese and Spanish, including challenges like prepositional dependencies, multiword units, and contractions. The goal is to create linguistically informed gold standard alignment sets to help machine translation tasks.
This paper presents a comparative study of alignment pairs, either contrasting expressions or stylistic variants of the same expression in the European (EP) and the Brazilian (BP) varieties of Portuguese. The alignments were collected semi-automatically using the CLUE-Aligner tool, which allows to record all pairs of paraphrastic units resulting from the alignment task in a database. The corpus used was a children’s literature book "Os Livros Que Devoraram o Meu Pai" (The Books that Devoured My Father) by the Portuguese author Afonso Cruz and the Brazilian adaptation of this book. The main goal of the work presented here is to gather equivalent phrasal expressions and different syntactic constructions, which convey the same meaning in EP and BP, and contribute to the optimisation of editorial processes compulsory in the adaptation of texts, but which are suitable for any type of editorial process. This study provides a scientific basis for future work in the area of editing, proofreading and converting text to and from any variety of Portuguese from a computational point of view, namely to be used in a paraphrasing system with a variety adaptation functionality, even in the case of a literary text. We contemplate “challenging” cases, from a literary point of view, looking for alternatives that do not tamper with the imagery richness of the original version.
KeyNote @SEMANTICS 2017 (Amsterdam, sept 2017) about convergences between NLP and KE at the era of the semantic web, with a focus on semantic relation extraction from text.
This document discusses the evolution of natural language processing (NLP) and knowledge engineering (KE) and their convergence, especially with the rise of deep learning and the semantic web. It outlines how NLP and KE have moved from early ambitions of full language understanding and problem solving to more practical, layered approaches focused on specific tasks. The semantic web provides standards and architectures that benefit both NLP and KE by enabling semantic annotation, linking of data, and use of knowledge sources. Deep learning allows NLP to learn representations from large corpora and benefit from semantic resources. Relation extraction and ontology learning from text are examples of the convergence. Challenges remain around contextual language, knowledge assertion, and industrial applications.
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the world of natural language - unstructured data that by its very nature has important latent information for humans. NLP practitioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, the Natural Language Toolkit (NLTK), and to a lesser extent, the Gensim Library.
NLTK is an excellent library for machine learning-based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. Gensim provides vector-based topic modeling, which is currently absent in both NLTK and Scikit-Learn. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
These slides present the the new features of ANNIS 3.1.6, 3.1.7 and a basic introduction oh what is ANNIS and how to use it. ANNIS is an open source, cross platform (Linux, Mac, Windows), web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...shakimov
The document summarizes a PhD dissertation defense talk on learning multilingual semantic parsers for question answering over linked data. It discusses comparing neural and probabilistic graphical model architectures for semantic parsing to map natural language to formal meaning representations. The talk outlines introducing dependency parse tree-based approaches, evaluating different model architectures, and addressing challenges in building multilingual question answering systems over structured knowledge bases.
This paper is the result of collaboration between two projects: Emocionário and eSPERTo.
Emocionário aims at organizing emotions in Portuguese and annotate them in corpora. eSPERTo is a paraphrasing system that uses the NooJ linguistic engine, grammars, and lexicons.
The aims for this collaboration were fivefold:(i) From the Emocionário’s point of view, it would be very useful to have an emotion paraphraser to help us identify more cases of emotions in our corpora; (ii) while from eSPERTo’s point of view adding emotion paraphrases would considerably enhance its paraphrasing power. (iii) Applying the emotion classification to an hitherto not used application domain would be a good way to evaluate Emocionário’s capabilities and shortcomings; (iv) and both projects would gain from learning more about real paraphrases of emotion in text. Finally, (v) an interesting question is to assess how good is the methodology employed to harvest emotion paraphrases from parallel text.
More Related Content
Similar to Automatic Paraphrasing of Human Intransitive Adjectives in Portuguese
This paper reports our first attempt of integrating eSPERTo’s paraphrastic engine, which is based on NooJ platform, with two application scenarios: a conversational agent, and a summarization system. We briefly describe eSPERTo’s base resources, and the necessary modifications to these resources
that enabled the production of paraphrases required to feed both systems. Although the improvement observed in both scenarios is not significant, we present a detailed error analysis to further improve the achieved results in future experiments.
This paper presents a methodology to extract a paraphrase database for the European and Brazilian varieties of Portuguese, and discusses a set of paraphrastic categories of multiwords and
phrasal units, such as the compounds toda a gente vs todo o mundo "everybody" or the gerundive constructions [estar a + V-Inf] vs [ficar + V-Ger] (e.g., estive a observar vs fiquei observando "I was observing"), which are extremely relevant to high quality paraphrasing. The variants were manually aligned in the e-PACT corpus, using the CLUE-Aligner tool. The methodology, inspired
in the Logos Model, focuses on a semantico-syntactic analysis of each paraphrastic unit and constitutes a subset of the Gold-CLUE-Paraphrases.1 The construction of a larger dataset of
paraphrastic contrasts among the distinct varieties of the Portuguese language is indispensable for variety adaptation, i.e., for dealing with the cultural, linguistic and stylistic differences between them, making it possible to convert texts (semi-)automatically from one variety into another, a
key function in paraphrasing systems. This topic represents an interesting new line of research with valuable applications in language learning, language generation, question-answering, summarization, and machine translation, among others. The paraphrastic units are the first resource of its kind for Portuguese to become available to the scientific community for research purposes.
The document provides an introduction to natural language processing (NLP), discussing key related areas and various NLP tasks involving syntactic, semantic, and pragmatic analysis of language. It notes that NLP systems aim to allow computers to communicate with humans using everyday language and that ambiguity is ubiquitous in natural language, requiring disambiguation. Both manual and automatic learning approaches to developing NLP systems are examined.
This document provides an introduction and overview of natural language processing (NLP). It discusses how NLP aims to allow computers to communicate with humans using everyday language. It also discusses related areas like artificial intelligence, linguistics, and cognitive science. The document outlines some key aspects of communication like intention, generation, perception, analysis, and incorporation. It discusses the roles of syntax, semantics, and pragmatics. It also covers challenges in NLP like ambiguity and how ambiguity is pervasive and can lead to many possible interpretations. The document contrasts natural languages with computer languages and provides examples of common NLP tasks.
Body-Part Nouns and Whole-Part Relations in PortugueseJorge Baptista
In this paper, we target the extraction of whole-part rela- tions involving human entities and body-part nouns in SYSTEM, a hy- brid statistical and rule-based Natural Language Processing chain for Portuguese. Whole-part relation is a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set.
Poster presented at the 2nd meeting of the COST Action CA16105 - enetCollect : European Network for Combining Language Learning with Crowdsourcing Techniques, which took place at Alexandru Ioan Cuza University, in Iasi, Romania.
This poster shows paraphrastic suggestions in the eSPERTo paraphrasing system applied to a QA application on a virtual agent and to a summarization tool. It also shows how paraphrases can be used in language learning and the tests envisaged to make eSPERTo a Portuguese learning tool.
This document provides an overview of natural language processing and planning topics including:
- NLP tasks like parsing, machine translation, and information extraction.
- The components of a planning system including the planning agent, state and goal representations, and planning techniques like forward and backward chaining.
- Methods for natural language processing including pattern matching, syntactic analysis, and the stages of NLP like phonological, morphological, syntactic, semantic, and pragmatic analysis.
Natural language processing (NLP) involves developing systems that allow computers to understand and communicate using human language. NLP aims to understand syntax, semantics, and pragmatics. It addresses challenges like ambiguity, where a sentence can have multiple possible meanings. Syntactic parsing is the process of analyzing a sentence's structure using a context-free grammar to produce a parse tree. Top-down and bottom-up parsing are two approaches to syntactic parsing where top-down starts with the start symbol and bottom-up starts with the sentence's terminal symbols.
Natural language processing (NLP) is focused on developing systems that allow computers to communicate with humans using everyday language. NLP involves computational methods to aid understanding of human language. Communication for both speakers and hearers involves processes like intention, generation, perception, analysis, syntactic interpretation, semantic interpretation, and pragmatic interpretation. Natural language is highly ambiguous and must be disambiguated at syntax, semantics, and pragmatics levels. Ambiguities compound and generate many possible interpretations. Both top-down and bottom-up parsing are used to analyze syntax, but explore search spaces differently.
This document provides an overview of syntax and its analysis. It discusses:
- What syntax is and why it is studied
- Applications of syntactic analysis like search, paraphrasing and information extraction
- The structure of words (morphology) and sentences (syntax) and their interplay
- Different representations of syntactic structure like trees and dependencies
- Context-free grammars and their use in syntactic analysis
- Representing syntactic information and constraints through attributes and unification
- Phenomena like structural priming and characteristics of spoken language syntax
This paper presents the alignment of verbal predicate constructions with the clitic pronoun "lhe" in the European (EP) and Brazilian (BP) varieties of Portuguese, such as in the sentences "Já lhe} arrumaram a bagagem" | "Sua bagagem está seguramente guardada" 'His baggage is safely stowed away', where the EP dative proclisis "lhe" contrasts with the BP possessive pronoun "sua". We have selected several different paraphrastic contrasts, such as proclisis and enclisis, clitic pronouns co-occurring with relative pronouns and negation-type adverbs, among other constructions to illustrate the linguistic phenomenon. Some differences correspond to real contrasts between the two Portuguese varieties, while others purely represent stylistic choices. The contrasting variants were manually aligned in order to constitute a gold standard dataset, and a typology has been established to be further enlarged and made publicly available. The paraphrastic alignments were performed in the e-PACT corpus using the CLUE-Aligner tool. The research work was developed in the framework of the eSPERTo project.
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
This document provides an introduction to natural language processing (NLP). It discusses key topics in NLP including languages and intelligence, the goals of NLP, applications of NLP, and general themes in NLP like ambiguity in language and statistical vs rule-based methods. The document also previews specific NLP techniques that will be covered like part-of-speech tagging, parsing, grammar induction, and finite state analysis. Empirical approaches to NLP are discussed including analyzing word frequencies in corpora and addressing data sparseness issues.
The document discusses challenges in cross-language word alignment. It outlines topics including word alignment concepts and applications, state of the art, and limitations due to phenomena like multiword units. Guidelines are presented for annotating alignments between English, French, Portuguese and Spanish, including challenges like prepositional dependencies, multiword units, and contractions. The goal is to create linguistically informed gold standard alignment sets to help machine translation tasks.
This paper presents a comparative study of alignment pairs, either contrasting expressions or stylistic variants of the same expression in the European (EP) and the Brazilian (BP) varieties of Portuguese. The alignments were collected semi-automatically using the CLUE-Aligner tool, which allows to record all pairs of paraphrastic units resulting from the alignment task in a database. The corpus used was a children’s literature book "Os Livros Que Devoraram o Meu Pai" (The Books that Devoured My Father) by the Portuguese author Afonso Cruz and the Brazilian adaptation of this book. The main goal of the work presented here is to gather equivalent phrasal expressions and different syntactic constructions, which convey the same meaning in EP and BP, and contribute to the optimisation of editorial processes compulsory in the adaptation of texts, but which are suitable for any type of editorial process. This study provides a scientific basis for future work in the area of editing, proofreading and converting text to and from any variety of Portuguese from a computational point of view, namely to be used in a paraphrasing system with a variety adaptation functionality, even in the case of a literary text. We contemplate “challenging” cases, from a literary point of view, looking for alternatives that do not tamper with the imagery richness of the original version.
KeyNote @SEMANTICS 2017 (Amsterdam, sept 2017) about convergences between NLP and KE at the era of the semantic web, with a focus on semantic relation extraction from text.
This document discusses the evolution of natural language processing (NLP) and knowledge engineering (KE) and their convergence, especially with the rise of deep learning and the semantic web. It outlines how NLP and KE have moved from early ambitions of full language understanding and problem solving to more practical, layered approaches focused on specific tasks. The semantic web provides standards and architectures that benefit both NLP and KE by enabling semantic annotation, linking of data, and use of knowledge sources. Deep learning allows NLP to learn representations from large corpora and benefit from semantic resources. Relation extraction and ontology learning from text are examples of the convergence. Challenges remain around contextual language, knowledge assertion, and industrial applications.
Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the world of natural language - unstructured data that by its very nature has important latent information for humans. NLP practitioners have benefitted from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python, the Natural Language Toolkit (NLTK), and to a lesser extent, the Gensim Library.
NLTK is an excellent library for machine learning-based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. Gensim provides vector-based topic modeling, which is currently absent in both NLTK and Scikit-Learn. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.
These slides present the the new features of ANNIS 3.1.6, 3.1.7 and a basic introduction oh what is ANNIS and how to use it. ANNIS is an open source, cross platform (Linux, Mac, Windows), web browser-based search and visualization architecture for complex multilayer linguistic corpora with diverse types of annotation.
Learning Multilingual Semantic Parsers for Question Answering over Linked Dat...shakimov
The document summarizes a PhD dissertation defense talk on learning multilingual semantic parsers for question answering over linked data. It discusses comparing neural and probabilistic graphical model architectures for semantic parsing to map natural language to formal meaning representations. The talk outlines introducing dependency parse tree-based approaches, evaluating different model architectures, and addressing challenges in building multilingual question answering systems over structured knowledge bases.
Similar to Automatic Paraphrasing of Human Intransitive Adjectives in Portuguese (20)
This paper is the result of collaboration between two projects: Emocionário and eSPERTo.
Emocionário aims at organizing emotions in Portuguese and annotate them in corpora. eSPERTo is a paraphrasing system that uses the NooJ linguistic engine, grammars, and lexicons.
The aims for this collaboration were fivefold:(i) From the Emocionário’s point of view, it would be very useful to have an emotion paraphraser to help us identify more cases of emotions in our corpora; (ii) while from eSPERTo’s point of view adding emotion paraphrases would considerably enhance its paraphrasing power. (iii) Applying the emotion classification to an hitherto not used application domain would be a good way to evaluate Emocionário’s capabilities and shortcomings; (iv) and both projects would gain from learning more about real paraphrases of emotion in text. Finally, (v) an interesting question is to assess how good is the methodology employed to harvest emotion paraphrases from parallel text.
O presente estudo propõe uma análise comparativa –linguística, mas também literária e cultural – entre as edições portuguesa e brasileira de uma obra de literatura infantojuvenil – Os Livros que devoraram o meu pai, do autor português Afonso Cruz –que integra as listas de leituras sugeridas, tanto nos planos curriculares de Portugal como do Brasil. O objetivo específico é apresentar e discutir uma seleção de unidades lexicais, locuções e estruturas frásicas com função adjetiva em alternância nas duas variedades – ou seja, entre as escolhas do autor na variedade PE e as correspondentes soluções adotadas na versão PB. A metodologia escolhida centra-se na análise linguística contrastiva posta em prática com o auxílio de ferramentas digitais baseadas no projeto eSPERTo com recurso a alinhamentos semiautomáticos usando a ferramenta CLUE-Aligner (REF). O corpus utilizado é composto pelas edições portuguesa e brasileira da obra em estudo. O objetivo geral deste trabalho é otimizar os processos editoriais necessariamente presentes na adaptação dos textos, assim como fazer o levantamento das principais dificuldades desse processo. Isso implica, entre outras coisas, uma tomada de consciência face aos limites impostos por um texto literário, como a ténue fronteira entre a adaptação indispensável e a intervenção excessiva. Partindo dos resultados alcançados, pretendemos ainda incentivara investigação de recursos linguísticos para os propósitos de edição, revisão e ensino de Português língua materna e/ou língua estrangeira, entre outras aplicações.
This document provides an introduction and welcome message from the local organizers of the 3rd annual enetCollect MC meeting being held in Lisbon, Portugal. The summary includes:
1) The organizers thank the speakers, chairs, members, volunteers, and sponsors for their contributions to the meeting.
2) They introduce the official host, Professor Isabel Trancoso, and provide details on her extensive experience and leadership roles in spoken language processing.
3) The organizers conclude by thanking everyone for their participation in the meeting in Lisbon.
This document discusses using syntactic-semantic analysis for information extraction in biomedicine. It aims to extract biomolecular events like phosphorylation from text. It uses dictionaries of entities and verbs associated with event types, and NooJ grammars to identify events. Evaluation on a shared task dataset shows average recall of 36.76% and precision of 65.58% for six event types. While results are promising, it discusses limitations like manual pattern identification and challenges with more complex event constructions.
This presentation addresses the problem of translating SVC, such as fazer uma operação (to make an operation). In particular, it focus on the MT of biomedical-related SVC. It argues that paraphrasing can help translate these MWE with a higher quality. This work is based on my PhD research, which addressed the problem of paraphrasing and translating SVC in general.
ReWriter uses linguistically based automated paraphrasing and text-editing mechanisms to help users with their writing needs by providing suggestions for customized text authoring. It also generates word and phrasal usage data to help guide decision-making. ReWriter can be used in word processing applications or linguistic quality control for both source and target texts and it is a useful pre-editor for machine translation. The linguistic resources behind ReWriter, the paraphrasing grammars, and the tools from which ReWriter was derived will also be described, in this particular case, we illustrate ReWriter as a tool to process legal language.
Poster presented at the 2nd meeting of the COST Action CA16105 - enetCollect : European Network for Combining Language Learning with Crowdsourcing Techniques, which took place at Alexandru Ioan Cuza University, in Iasi, Romania.
The poster shows how chatbots can play an important role in Language Learning applications.
This paper presents the automation process of paraphrasing and converting Portuguese constructions typical of informal or spoken language into a formal written language. We illustrate this automation process with examples extracted from the e-PACT corpus that involve the placement of clitic pronouns in verbal compound contexts. Our task consists in paraphrasing and normalizing, among others, constructions such as "vou-lhe/posso-lhe fazer uma surpresa" into "vou/posso fazer-lhe uma surpresa" `lit: I will/can\_to him/her make a surprise / I will/can make\_to him/her a surprise; I will/can make him/her a surprise', where the clitic pronoun "lhe" migrates from an enclitic position after the first verb of the verbal compound to an enclitic position after the main verb, which is the verb responsible for the selection of that pronominal argument. The first verb is either an auxiliary verb or a volitive verb, e.g. "querer" `want'. This is a standard revision procedure in EP. Cases like this represent linguistic phenomena where in general language students and language users get confused or stumble. The paper focuses on general language where the phenomena being observed occur, describes examples of interest found in the corpus, and presents an automatic solution for the normalization of informal syntactic inadequacies found in the researched structures into standard formal writing structures through the application of very generic transformational grammars.
This paper performs a detailed analysis on the alignment of Portuguese contractions, based on a previously aligned bilingual corpus. The alignment task was performed manually in a subset of the English-Portuguese CLUE4Translation Alignment Collection. The initial parallel corpus was pre-processed and a decision was made as to whether the contraction should be maintained or decomposed in the alignment. Decomposition was required in the cases in which the two words that have been concatenated, i.e., the preposition and the determiner or pronoun, go in two separate translation alignment pairs (PT- [no seio de] [a União Europeia] EN- [within] [the European Union]). Most contractions required decomposition in contexts where they are positioned at the end of a multiword unit. On the other hand, contractions tend to be maintained when they occur at the beginning or in the middle of the multiword unit, i.e., in the frozen part of the multiword (PT- [no que diz respeito a] EN- [with regard to] or PT- [além disso] EN-[in addition]. A correct alignment of multiwords and phrasal units containing contractions is instrumental for machine translation, paraphrasing, and variety adaptation.
O documento descreve o sistema eSPERTo, que gera paráfrases para edição e revisão de texto. O objetivo principal do projeto é desenvolver um sistema capaz de identificar e gerar paráfrases para melhorar a compreensão, simplificar a linguagem e auxiliar na aprendizagem da língua portuguesa. O sistema pode ser útil em vários ambientes como educação, jornalismo e tradução.
ReEscreve (in English, ReWriter) is a multi-purpose paraphraser that uses grammar-based paraphrasing capabilities suitable for source and target control (pre- and post-editing) and is useful for human and machine translation.
Spoken Language Systems Lab @ INESC-ID poster presented at the 1st meeting of the COST Action CA16105 - enetCollect : European Network for Combining Language Learning with Crowdsourcing Techniques, which took place at Eurac Research in Bolzano, Italy.
This presentation describes the integration of lexicon-grammar of predicate nouns with the support verb "fazer" ("to do" or "to make") into Port4NooJ, the Portuguese language module for NooJ. Port4NooJ resources are used by eSPERTo system to generate paraphrases, i.e., alternative ways to say or write the same sentence.
This preseantation addresses the impact of multiword translation errors in machine translation (MT). We have analysed translations of multiwords in the OpenLogos
rule-based system (RBMT) and in the Google Translate statistical system (SMT) for the English-French, English-Italian, and English-Portuguese language pairs. Our study shows that, for distinct reasons, multiwords remain a problematic area for MT independently of the approach, and require adequate linguistic quality evaluation metrics founded on a systematic categorization of errors by MT expert linguists. We propose an empirically-driven taxonomy for multiwords, and highlight the need for the development of specific corpora for multiword evaluation. Finally, the paper presents the Logos approach to multiword processing, illustrating how semantico-syntactic rules contribute to multiword translation quality.
The document discusses content writing optimization using a rewriter tool. It outlines the rewriter's capabilities such as providing paraphrases through synonyms, support verb transformations, and other linguistic rules. The rewriter has been developed to aid writing, translation, and machine translation. Evaluation results show the rewriter improves translation quality by paraphrasing complex linguistic constructs like support verb constructions. The document also discusses the rewriter's interface and linguistic resources and rules, and how the tool could benefit writers, translators and language learners.
The document discusses human translation versus machine translation. It notes that while human translation requires skills like language proficiency and cultural knowledge, machine translation relies on linguistics and computer science. The document also outlines some challenges of machine translation, such as ambiguity and complex grammar, and presents examples of how machine translation systems struggle with issues like lexical ambiguity. Resources and tools developed for machine translation are also summarized, including lexical databases and paraphrasing tools.
Nesta sessão falarei acerca de controlo da qualidade linguística e parárafrases e do seu papel enquanto ajudantes da tradução automática. Para além disso, apresentarei uma ferramenta que utiliza métodos parafrásticos para a tradução.
quot; Escola de Verão Belinda Maia (Edv 2009)
(FLUP Porto Portugal 29 de Junho - 3 de Julho 2009)
More from INESC-ID (Spoken Language Systems Laboratory - L2F) (20)
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
CAKE: Sharing Slices of Confidential Data on BlockchainClaudio Di Ciccio
Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus.
Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain.
Paper: https://doi.org/10.1007/978-3-031-61000-4_16
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Mind map of terminologies used in context of Generative AI
Automatic Paraphrasing of Human Intransitive Adjectives in Portuguese
1. technology
from seed"
Paraphrasing
Human
Intransitive
Adjective
Constructions
in
Port4NooJ
CRISTINA
MOTA1
PAULA
CARVALHO1,2
FRANCISCO
RAPOSO1
ANABELA
BARREIRO1
1
INESC-‐ID,
Lisbon
2
Universidade
Europeia
|
Laureate
International
Universities
International NooJ 2015 Conference Ÿ Minsk, 13 June
2. eSPERTo
–
System
for
Paraphrasing
in
Editing
and
Revision
of
Texts
• Main
objective
– Design
and
development
of
a
linguistically
enhanced
paraphrase
generator
• Semantico-‐syntactic
and
multiword
units
• Sensitive
to
context
• Method
– Hybrid
system,
combining
statistics
and
linguistic
knowledge
to
identify
and
generate
new
and
more
complex
paraphrases
– Exploitation
of
existing
paraphrasing
resources
• Web
platform
– Interactive
application
to
help
Portuguese
language
learners
in
producing
and
revising
their
texts
– Text-‐editing
mechanisms
which
provide
a
variety
of
alternatives
for
each
expression
– Users
can
choose
or
suggest
expressions
that
can
be
immediately
applied
to
their
text
– Support
to
writing
optimization,
understandability
and
translatability
Introduction
to
the
eSPERTo
Project
2
3. Linguistic Resources
• Linguistic
knowledge
databases
Port4NooJ
Eng4Nooj
• Originally
(English-‐Portuguese)
OpenLogos
resources
(http://logos-os.dfki.de/)
• Converted
into
NooJ
format
• Enhanced
with
new
properties,
including
derivational
and
morpho-‐syntactic
and
semantic
relations
4. Earlier
versions
• Phrasal
verbs
into
equivalent
expressions
– to
clear
up
(weather)
=
(weather)
to
become
better/brighter
• Support
verb
constructions
into
single
verbs
– to
make
a
decision
=
to
decide
– to
make
a
presentation
of
=
present
– to
give
support
to
N(AN)
=
to
support
N(AN)
– to
get
into
contact
with
=
to
contact
– to
become
acid
=
to
acidify
• Support
verb
constructions
into
their
stylistic
variants
– to
make
an
audit
=
to
perform
an
audit
– to
make
an
impression
=
to
create
an
impression
• Aspectual
constructions
into
single
verbs
– to
launch
an
attack
=
to
attack
5. Earlier
versions
• Adverbs
(compounds
into
single
adverbs)
– in
a
constructive
way
=
constructively
– on
purpose
purposely
=
deliberately
• Relatives
into
participial
adjectives
– the
president
that
was
elected
=
the
president
elect
• Relatives
into
possessives
– the
role
that
Europe
plays/has
=
the
role
of
Europe
– the
position
that
the
Church
defends
=
the
position
of
the
Church
• Relatives
into
compound
nouns
(and
vice-‐versa)
– a
container
for
the
milk
=
a
milk
container
– a
bottle
made
of
plastic
=
a
plastic
bottle
• Agentive
passives
into
actives
– the
young
man
is
released
by
the
police
ofZicer
=
the
police
ofZicer
releases
the
young
man
6. eSPERTo
Architecture
6
eSPERTo online
Combine Text
and
suggestions
Input Text
+
Resource
selection
noojapply + STRING
Parahrase
suggestions Port4NooJ
Dictionaries Grammars
Eng4Nooj
...
Ital4NooJ
Fren4NooJ
Ger4NooJ
Spa4NooJ
Linguist
Validation
Hybrid
Paraphrase
Acquisition
User
feedbackDictionaries Grammars
9. eSPERTo:
noojapply
Integration
9
noojapply pt result.ind lr.no(d|m)* sr.nog* REESCREVE.nog text.txt
eSPERTo Web Interface
User configuration
10. eSPERTo:
noojapply
Integration
10
noojapply pt result.ind lr.no(d|m)* sr.nog* REESCREVE.nog text.txt
eSPERTo Web Interface
User configuration
11. eSPERTo:
noojapply
Integration
11
noojapply pt result.ind lr.no(d|m)* sr.nog* REESCREVE.nog text.txt
eSPERTo Web Interface
User configuration
eSPERTo Web Interface
Result presentation
teste.txt:0,17,O homem que é americano
teste.txt:0,17,O homem da América
teste.txt:0,17,O homem de nacionalidade americana
teste.txt:0,17,O homem de naturalidade americana
teste.txt:0,17,O homem de origem americana
teste.txt:0,39,o trabalho foi apresentado pelo homem americano
teste.txt:18,10,efectuar apresentação
teste.txt:18,10,fazer apresentação
teste.txt:18,10,realizar apresentação
12. eSPERTo:
noojapply
integration
12
the man who is American
the man from America
the man with American nationality
…
The American man
https://esperto.l2f.inesc-id.pt/esperto/esperto/demo.pl
13. LG
of
Portuguese
Human
Intransitive
Adjectives
13
• eSPERTo
was
enhanced
with
new
paraphrases,
derived
from
15
Lexicon-‐Grammar
(LG)
tables
describing
the
distributional
properties
of
4,250
human
intransitive
adjectives
(HIA)
(Carvalho,
2008):
– Syntactic
and
semantic
nature
of
the
subject
modiZied
by
each
adjective;
– Copulative
verbs
selected
by
each
adjective;
– Constraints
on
the
quantiZication
of
adjectives
by
an
adverb
or
a
degree
morpheme;
– Position
of
adjectives
in
adnominal
context;
– Optional
adjective
“complements”;
– Generic
NP
and
cross-‐constructions,
where
the
adjective
Zills
the
head
of
a
noun
phrase;
– Characterizing
indeZinite
constructions,
where
the
adjective
occurs
after
an
indeZinite
article;
– Exclamative
sentences
expressing
insult.
14. Adjective
Selection
CETEMPublico Adj
17.300 lemmas
Predicate Adj
13.875 lemmas
Adj Intrans Hum
4.250 lemmas
Adj Doen
187
Adj Filo
303
Adj Nac
651
Adj Hum
3.109
Lookup with LabEL lexical resources (LABEL-LEX)
(Ranchhod et al. 2004)
Pre-selection and classification of Adj according to
the linguistic criteria defined in Carvalho (2001)
Nhum Vcop Adj
15. Hum
Adj
SubclassiKication
Criteria
ADJ HUM
SER
SER + ESTAR
N0 ser Adj
N0 ser um Adj
N0=: Nhum
Nap de Nhum
QueF
N0=: Nhum
N0=: Nhum
Nap de Nhum
N0=: Nhum
Nap de Nhum
QueF
N0=: Nhum
N0=: Nhum
Nap de Nhum
ESTAR
N0 (ser+estar)
Adj
N0 ser um Adj
N0 estar Adj
N0=: Nhum
Nap de Nhum
N0=: Nhum
N0=: Nhum
Nap de Nhum
N0=: Nhum
N0=: Nhum
Nap de Nhum
N0=: Nhum
16. Hum
Adj
SubclassiKication
Criteria
ADJ HUM
SER
SER + ESTAR
N0 ser Adj
N0 ser um Adj
ESTAR
N0 (ser+estar)
Adj
N0 ser um Adj
N0 estar Adj
SAHP1
inteligente
SAHP2
atlético
SAHP3
culto
SAHC1
idiota
SAHC2
sedutor
SAHC3
inculto
EAHP2
abatido
EAHP3
zangado
SEAHP2
bonito
SEAHP3
velho
SEAHC2
gordo
SEAHC3
bêbado
18. • Adjective,
noun
and
verb
morphologically
related
constructions
– está
zangado
(is
angry)
=
zangou-‐se
(got
(self)
angry)
=
esteve
envolvido
numa
zanga
(was
involved
in
anger)
• Adjective
constructions
supported
by
different
copulative
verbs
– estar
perdido
(to
be
lost)
=
andar
perdido
(walk
around
lost)
• Constructions
involving
nationality
and
other
membership
relations
– (de
origem
portuguesa
(of
Portuguese
origin/roots)
=
portugueses
(Portuguese)
=
de
Portugal
(from
Portugal)
– beniquista
(Benica
fan)
=
do
Sport
Lisboa
e
Benica
(a
fan
of
Sport
Lisboa
e
Benica)
• Cross-‐constructions
– o
idiota
do
rapaz
(the
idiot
of
the
boy)
=
o
rapaz
é
um
idiota
(the
boy
is
an
idiot)
• Appropriate
noun
constructions
– foi
moderado
nos
seus
comentários
(he
was
moderated
in
his
comments)
=
os
seus
comentários
foram
moderados
(his
comments
were
moderated)
=
foi
moderado
(he
was
moderated)
• Generic
noun
phrases
– é
um
indivíduo
estúpido
(he
is
a
fool)
=
é
um
estúpido
(he
is
a
fool)
=
é
estúpido
(he
is
a
fool)
New
Transformations
18
19. – From
LG
tables
to
NooJ
dictionaries
• Mostly
done
automatically
with
different
scripts
Integration
of
LG
of
Portuguese
Human
Intransitive
Adjectives
19
Port4NooJ
LG tables
Adjectivos_IH
ü If adjective in Port4NooJ merge the LG
properties with dictionary entry else
create new entry
ü Create FLX and DRV codes and
corresponding rules as needed
ü Check for missing FLX and DRV codes
20. – From
LG
tables
to
NooJ
dictionaries
• Representation
of
LG
table
properties
Integration
of
LG
of
PT
HIA
20
+Top=Abissínia
+TopDET=a
+NclassPnacionalidade
+NAdj
+Vcopser
+IH
+Table=SAN
21. – From
LG
tables
to
NooJ
dictionaries
• Representation
of
LG
table
properties
Integration
of
LG
of
PT
HIA
21
+Top=Abissínia
+TopDET=a
+NclassPnacionalidade
+NAdj
+Vcopser
+IH
+Table=SAN
Determined automatically by
consulting AC/DC corpora
o homem abissínio ó o homem da Abissínia
o homem açoriano ó o homem dos Açores
o homem português ó o homem de Portugal
22. – From
LG
tables
to
NooJ
dictionaries
• Representation
of
LG
table
properties
Integration
of
LG
of
PT
HIA
22
+IH
+Table=SEAHP3
+Nome=alegria
+Verbo=alegrar-se
+Nnhum
23. – From
LG
tables
to
NooJ
dictionaries
• Representation
of
LG
table
properties
Integration
of
LG
of
PT
HIA
23
+IH
+Table=SEAHP3
+DRV=A2N143:CASA
+DRV=A2V6:FALAR
+Reflexivo
+Nnhum
24. – From
LG
tables
to
NooJ
dictionaries
• Representation
of
LG
table
properties
Integration
of
LG
of
PT
HIA
24
+IH
+Table=SEAHP3
+DRV=A2N143:CASA
+DRV=A2V6:FALAR
+Reflexivo• DRV code is determined and formalized automatically by finding
the radical between the adjective and the noun or verb
alegr(ia) = A2N143 = B1ia/N
alegr(ar) = A2V6 = B1ar/V
• FLX code is determined by consulting Port4NooJ
alegria,N+FLX=CASA+AB+state+EN=joy+SYNN=contentamento
alegrar,V+FLX=FALAR+Aux=1+PRECVagree-type+Subset=…
If the derived form does not exist, then its code is assigned
automatically
+Nnhum
28. – From
LG
tables
to
NooJ
dictionaries
• Integration
with
eSPERTo
dictionary
entries
② Adjective
not
in
Port4NooJ
(or
in
but
is
derived
from
another
entry):
ü FLX
code
is
assigned
automatically
given
the
ending
of
the
word
ü Entries
are
checked
for
missing
FLX
codes
and
reviewed
by
a
linguist
ü All
other
properties
come
from
LG
table
abissínio,A+FLX=ALTO+IH+Table=SAN+Nhum+Vcopser+Vcoptornarse+UMNclas
+UmModif+NclassPserde+NclassPorigem+NclassPnacionalidade
+NclassPnaturalidade+NAdj+Top=Abissínia+TopDET=a
(no
entry
in
Port4Nooj)
arranhado,A+FLX=ALTO+IH+Table=EAHP2+Nhum+NapdeNhum+Npc+Vcopestar
+AdvQuant+Superlativo+NAdj+NhumVopAPrepNap+deemEDefNap
+DRV=A2N4:BALÃO+DRV=A2V2:FALAR+Reflexivo
(In
Port4Nooj:
arranhar,V+FLX=FALAR...)
solteiro,A+FLX=ALTO+IH+Table=SEAHP3+Nhum+Vcopser+Vcopestar+Vcopficar
+Vcoppermanecer+Vcopencontrarse+UMNclas+UmModif+Superlativo+NAdj
(In
Port4NooJ:
solteiro,N+FLX=ANO+AN+des+EN=bachelor)
Integration
of
LG
of
PT
HIA
28
29. – From
LG
to
NooJ
grammars
• Option
1:
Syntactic
Parsing
Integration
of
LG
of
PT
HIA
29
30. – From
LG
to
NooJ
grammars
• Option
1:
Syntactic
Parsing
Integration
of
LG
of
PT
HIA
30
Input1
Output1
o homem é tonto
REESCREVE+TEXTO=é um tonto
é tonto
/REESCREVE
31. – From
LG
to
NooJ
grammars
• Option
1:
Syntactic
Parsing
Integration
of
LG
of
PT
HIA
31
Input2
Output2
o homem é um tonto
REESCREVE+TEXTO=é tonto
é um tonto
/REESCREVE
32. – From
LG
to
NooJ
grammars
• Option
2:
Transformational
module
Integration
of
LG
of
PT
HIA
32
33. – From
LG
to
NooJ
grammars
• Option
2:
Transformational
module
Integration
of
LG
of
PT
HIA
33
Input1
Input2
ó
é tonto, REESCREVE+Cpred
é um tonto, REESCREVE+CCI
34. Preliminary
Results
34
• 5
150
human
intransitive
adjectives
• 677
new
derivational
paradigms
• Example
grammars
for
the
syntactic
parser
and
the
transformational
module
• 50%
increase
in
Port4NooJ
adjective
entries
35. Preliminary
Results
35
• 5
150
human
intransitive
adjectives
• 677
new
derivational
paradigms
• Example
grammars
for
the
syntactic
parser
and
the
transformational
module
• 50%
increase
in
Port4NooJ
adjective
entries
Table Example In,Port4NooJ New %,In
SAHP1 inteligente 303 247 55%
SAHP2 atlético 142 226 39%
SEAHP2 bonito 53 87 38%
SAHC1 idota 115 229 33%
SAHP3 culto 97 263 27%
SEAHP3 velho 32 93 26%
SEAHC2 gordo 14 41 25%
SAF anarquista 70 234 23%
SEAHC3 bêbado 15 53 22%
SEAD leproso 39 149 21%
EAHP3 zangado 54 213 20%
SAHC2 sedutor 41 177 19%
EAHP2 abatido 18 87 17%
SAN americano 108 544 17%
SAHC3 inculto 54 465 10%
Total 1155 3108 26%
36. • Complete
the
integration
of
the
LG
of
human
intransitive
adjectives
– By
creating
all
grammars
to
process
the
constructions
formalized
in
LG
• Revise
and
evaluate
the
new
resources
• Integrate
and
adapt
additional
LG
grammars:
– Constructions
with
Vsup
ser
de
(Baptista,
2005)
– Constructions
with
Vsup
fazer
(Chacoto,
2005)
• Use
the
grammar
paraphrase
knowledge
to
create
a
corpus
of
paraphrases
to
develop
eSPERTo
hybrid
paraphrase
acquisition
engine
– Train
machine
learning
paraphrase
acquisition
system
– Annotate
semantico-‐syntactic
and
multiword
paraphrases
in
corpora
to
use
in
training
and
evaluation
– Merge
of
paraphrases
collected
statistically
Next
Steps
36