This paper gives complete guidelines for authors submitting papers for the AIRCC Journals. This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural language processing. We will suggest a real-time solution, based on the idea of using some words or groups of words that indicate grammatical category; and some specific endings of some parts of sentence. Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or some specific endings can provide a lot of information about the structure of a complex sentence. Such characteristics can be found in other languages, too, such as French. Using a special grammar, we developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
Unifying Logical Form and The Linguistic Level of LFHady Ba
This document summarizes a talk given by Mouhamadou El Hady Ba on unifying logical form between linguistics and logic. The talk discusses (1) the generative linguistic framework and the linguistic level of logical form (LF), (2) logicians' view of logical form as translating natural language into an artificial logical language, and (3) how generalized quantification allows for a unification of the two views of logical form by treating noun phrases as generalized quantifiers. The unification shows that linguists' LF and logicians' logical form are equivalent and reveals the true logical structure of natural language sentences. However, problems remain in fully identifying the two views of logical form and determining whether grammatical rules are truly
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
GSCL2013.A Study of Chinese Word Segmentation Based on the Characteristics of...Lifeng (Aaron) Han
Language Processing and Knowledge in the Web - Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, (GSCL 2013), Darmstadt, Germany, on September 25–27, 2013. LNCS Vol. 8105, Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. (EI)
The document describes the Leipzig Glossing Rules, which provide conventions for interlinear morpheme-by-morpheme glosses. The rules were developed jointly by linguists at the Max Planck Institute and University of Leipzig. The rules cover aligning glosses with text, separating morphemes, using abbreviations for grammatical categories, and conventions for one-to-many correspondences between language and gloss. The goal is to make common glossing practices explicit and provide a standard set of conventions for linguists. Feedback is welcome to improve the rules over time.
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...Svetlin Nakov
Nakov P., Nakov S., Paskaleva E., Improved Word Alignments Using the Web as a Corpus, Proceedings of the International Conference RANLP 2007 (Recent Advances in Natural Language Processing), pp. 400-405, ISBN 978-954-91743-7-3, Borovets, Bulgaria, 27-29 September 2007
Automatic Identification of False Friends in Parallel Corpora: Statistical an...Svetlin Nakov
Scientific article: False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the wordsâ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available.
This document discusses Lexical Functional Grammar (LFG) and Generalized Phrase Structure Grammar (GPSG). LFG was developed in the 1970s and emphasizes analyzing phenomena in lexical and functional terms. It uses two levels of structure: c-structure, which is a tree structure, and f-structure, which captures grammatical functions. GPSG was developed in 1985 and is confined to context-free phrase structure rules. It uses immediate dominance and linear precedence rules.
Unsupervised Extraction of False Friends from Parallel Bi-Texts Using the Web...Svetlin Nakov
Scientific paper: False friends are pairs of words in two languages
that are perceived as similar, but have different
meanings, e.g., Gift in German means poison in
English. In this paper, we present several unsupervised
algorithms for acquiring such pairs
from a sentence-aligned bi-text. First, we try different
ways of exploiting simple statistics about
monolingual word occurrences and cross-lingual
word co-occurrences in the bi-text. Second, using
methods from statistical machine translation, we
induce word alignments in an unsupervised way,
from which we estimate lexical translation probabilities,
which we use to measure cross-lingual
semantic similarity. Third, we experiment with
a semantic similarity measure that uses the Web
as a corpus to extract local contexts from text
snippets returned by a search engine, and a bilingual
glossary of known word translation pairs,
used as “bridges”. Finally, all measures are combined
and applied to the task of identifying likely
false friends. The evaluation for Russian and
Bulgarian shows a significant improvement over
previously-proposed algorithms.
Unifying Logical Form and The Linguistic Level of LFHady Ba
This document summarizes a talk given by Mouhamadou El Hady Ba on unifying logical form between linguistics and logic. The talk discusses (1) the generative linguistic framework and the linguistic level of logical form (LF), (2) logicians' view of logical form as translating natural language into an artificial logical language, and (3) how generalized quantification allows for a unification of the two views of logical form by treating noun phrases as generalized quantifiers. The unification shows that linguists' LF and logicians' logical form are equivalent and reveals the true logical structure of natural language sentences. However, problems remain in fully identifying the two views of logical form and determining whether grammatical rules are truly
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
GSCL2013.A Study of Chinese Word Segmentation Based on the Characteristics of...Lifeng (Aaron) Han
Language Processing and Knowledge in the Web - Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, (GSCL 2013), Darmstadt, Germany, on September 25–27, 2013. LNCS Vol. 8105, Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. (EI)
The document describes the Leipzig Glossing Rules, which provide conventions for interlinear morpheme-by-morpheme glosses. The rules were developed jointly by linguists at the Max Planck Institute and University of Leipzig. The rules cover aligning glosses with text, separating morphemes, using abbreviations for grammatical categories, and conventions for one-to-many correspondences between language and gloss. The goal is to make common glossing practices explicit and provide a standard set of conventions for linguists. Feedback is welcome to improve the rules over time.
Nakov S., Nakov P., Paskaleva E., Improved Word Alignments Using the Web as a...Svetlin Nakov
Nakov P., Nakov S., Paskaleva E., Improved Word Alignments Using the Web as a Corpus, Proceedings of the International Conference RANLP 2007 (Recent Advances in Natural Language Processing), pp. 400-405, ISBN 978-954-91743-7-3, Borovets, Bulgaria, 27-29 September 2007
Automatic Identification of False Friends in Parallel Corpora: Statistical an...Svetlin Nakov
Scientific article: False friends are pairs of words in two languages that are perceived as similar but have different meanings. We present an improved algorithm for acquiring false friends from sentence-level aligned parallel corpus based on statistical observations of words occurrences and co-occurrences in the parallel sentences. The results are compared with an entirely semantic measure for cross-lingual similarity between words based on using the Web as a corpus through analyzing the wordsâ local contexts extracted from the text snippets returned by searching in Google. The statistical and semantic measures are further combined into an improved algorithm for identification of false friends that achieves almost twice better results than previously known algorithms. The evaluation is performed for identifying cognates between Bulgarian and Russian but the proposed methods could be adopted for other language pairs for which parallel corpora and bilingual glossaries are available.
This document discusses Lexical Functional Grammar (LFG) and Generalized Phrase Structure Grammar (GPSG). LFG was developed in the 1970s and emphasizes analyzing phenomena in lexical and functional terms. It uses two levels of structure: c-structure, which is a tree structure, and f-structure, which captures grammatical functions. GPSG was developed in 1985 and is confined to context-free phrase structure rules. It uses immediate dominance and linear precedence rules.
Phrase structure grammar models the internal structure of sentences in a hierarchical organization. It represents sentences as consisting of phrases, which are made up of words, which are made up of morphemes and phonemes. Phrase structure grammars use rewrite rules to break down syntactic structures into their constituent parts in a step-by-step manner. Deep structure represents the underlying meaning of a sentence, while surface structure is the actual form used. Transformational rules derive surface structure from deep structure.
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Guy De Pauw
This document discusses tokenization and computational verb morphology for Setswana, a Bantu language with a disjunctive orthography. It presents an approach that combines two tokenization transducers and a morphological analyzer to effectively tokenize Setswana text. The approach was tested on a short Setswana text and achieved 93.6% accuracy between the automatically and hand-tokenized texts. While mostly successful, some issues remained around longest matches that were not valid tokens or did not allow morphological analysis. Overall, the approach demonstrated that a precise tokenizer and morphological analyzer can largely resolve the challenges of Setswana's disjunctive writing system.
Natural Language Generation from First-Order ExpressionsThomas Mathew
In this paper I discuss an approach for generating natural language from a first-order logic representation. The approach shows that a grammar definition for the natural language and a
lambda calculus based semantic rule collection can be applied for bi-directional translation using an overgenerate-and-prune mechanism.
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
A COMPUTATIONAL APPROACH FOR ANALYZING INTER-SENTENTIAL ANAPHORIC PRONOUNS IN...ijnlc
This paper presents a strategy and a computational model for solving inter-sentential anaphoric pronouns
in Vietnamese paragraphs composing simple sentences. The strategy is proposed based on grammatical
features of nouns and the focus phenomenon when using pronouns in Vietnamese. In this research, we
consider only nouns and pronouns which are human objects in the paragraph, and each anaphoric
pronoun will appear one time in one sentence and can appear in adjacent sentences. The computational
model is implemented in Prolog and based on applying and improving the models of Mark Johnson and
Ewan Klein, had been improved by Covington and Schmitz, with theoretical background of Discourse
Representation Theory.Analysis of test results shows that this approach which based on linguistic theories
helps for well solving inter-sentential anaphoric pronouns in Vietnamese paragraphs.
The document discusses context-free grammars for modeling English syntax. It introduces key concepts like constituency, grammatical relations, and subcategorization. Context-free grammars use rules and symbols to generate sentences. They consist of terminal symbols (words), non-terminal symbols (phrases), and rules to expand non-terminals. Context-free grammars can model syntactic knowledge and generate sentences in both a top-down and bottom-up manner through parsing.
Natural Language Processing (NLP) involves using computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text into predefined tags or categories, and tokenization which breaks text into discrete words and punctuation. NLP is used to teach machines how to read and understand human languages by identifying relationships between words and entities. Other areas of NLP include parts of speech tagging, constituent structure analysis, and analysis of pronunciation, morphology, syntax, semantics, and pragmatics.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
The Application of Distributed Morphology to the Lithuanian First Accent Noun...Adrian Lin
In this paper I apply the theoretical linguistics framework of distributed morphology to analyse the Lithuanian first accent noun declension paradigm. Lithuanian is an Indo-European language that retains numerous archaic features lost in its other Indo-European languages. Its nouns has 12 declensions and, like other aspects of its grammar, is very complex. As such, this topic is well suited by a distributed morphology analysis.
This document provides an overview of the Minimalist Program (MP) in linguistics. It discusses the following key points:
1. The MP aims to develop a simple linguistic model with minimal components and operations. It builds on principles of economy of derivation and representation from earlier theories like Government and Binding Theory.
2. Core concepts in the MP framework include morphosyntactic features, uninterpretable features, interpretable features, phases of derivation, probes and goals. Derivations proceed through numeration, spell-out at phases, and interpretation at the interfaces of phonetic form and logical form.
3. A phase is a syntactic domain like CP or VP that structures the derivation. Probes are
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
Parts-of-speech can be divided into closed classes and open classes. Closed classes have a fixed set of members like prepositions, while open classes like nouns and verbs are continually changing with new words being created. Parts-of-speech tagging is the process of assigning a part-of-speech tag to each word using statistical models trained on tagged corpora. Hidden Markov Models are commonly used, where the goal is to find the most probable tag sequence given an input word sequence.
Relative clause with their equivalence from English into Indonesianimadejuliarta
This document provides an introduction and background to a study on translating relative clauses from English to Indonesian. It discusses translation theory and defines translation as transferring meaning from the source language to the target language. The document reviews previous studies on analyzing relative clauses and nominal clauses. It establishes the problems of this study, which are to identify types of relative clauses, analyze their syntactic functions, and investigate translation equivalence between the source and target languages. The significance of the study is its theoretical contribution to translation and linguistics, as well as its practical application for language learners. The scope is focused on analyzing relative clause types and translation procedures from English to Indonesian.
The document discusses Performance Grammar (PG), a psycholinguistically motivated grammar formalism that aims to describe syntactic processing phenomena in language production and comprehension. PG uses a two-stage process to generate sentences, first assembling an unordered hierarchical structure and then applying linear precedence rules to order constituents from left to right. However, a purely precedence-based approach cannot account for incremental sentence production. To address this, PG supplements precedence rules with rules that assign constituents to absolute positions within a constituent's spatial layout.
This paper performs a detailed analysis on the alignment of Portuguese contractions, based on a previously aligned bilingual corpus. The alignment task was performed manually in a subset of the English-Portuguese CLUE4Translation Alignment Collection. The initial parallel corpus was pre-processed and a decision was made as to whether the contraction should be maintained or decomposed in the alignment. Decomposition was required in the cases in which the two words that have been concatenated, i.e., the preposition and the determiner or pronoun, go in two separate translation alignment pairs (PT- [no seio de] [a União Europeia] EN- [within] [the European Union]). Most contractions required decomposition in contexts where they are positioned at the end of a multiword unit. On the other hand, contractions tend to be maintained when they occur at the beginning or in the middle of the multiword unit, i.e., in the frozen part of the multiword (PT- [no que diz respeito a] EN- [with regard to] or PT- [além disso] EN-[in addition]. A correct alignment of multiwords and phrasal units containing contractions is instrumental for machine translation, paraphrasing, and variety adaptation.
This document discusses parsing with context-free grammars. It begins by introducing context-free grammars and their use in parsing sentences. It then discusses parsing as a search problem, and presents top-down and bottom-up parsing algorithms. Top-down parsing builds trees from the root node down, while bottom-up parsing builds trees from the leaves up. Both approaches have advantages and disadvantages related to efficiency. The document also introduces probabilistic context-free grammars, which augment grammars with rule probabilities, and discusses how these can be used for disambiguation.
This document provides an overview of grammatical contrastive analysis, focusing on inflectional morphology. It discusses key grammatical categories such as aspect, case, gender, mood, number, and tense, and how they are conveyed through inflectional morphemes in some languages versus lexical means in others. The document uses examples from English and Chinese to illustrate differences in how these categories are expressed grammatically or lexically across languages. It aims to lay the groundwork for contrasting the morphological and lexical devices languages employ to convey grammatical meanings.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals.
This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural
language processing. We will suggest a real-time solution, based on the idea of using some words or
groups of words that indicate grammatical category; and some specific endings of some parts of sentence.
Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or
some specific endings can provide a lot of information about the structure of a complex sentence. Such
characteristics can be found in other languages, too, such as French. Using a special grammar, we
developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and
interogative sentences about a “story” (a set of sentences describing some events from the real life).
Syntactic Analysis Based on Morphological characteristic Features of the Roma...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals.
This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural
language processing. We will suggest a real-time solution, based on the idea of using some words or
groups of words that indicate grammatical category; and some specific endings of some parts of sentence.
Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or
some specific endings can provide a lot of information about the structure of a complex sentence. Such
characteristics can be found in other languages, too, such as French. Using a special grammar, we
developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
The document discusses realistic grammatical description and argues for maintaining a distinction between grammatical and ungrammatical sentences. It proposes that grammars developed for natural language processing should have two components: one describing grammatical sentences and one describing ungrammatical sentences observed in natural language use. Grammaticality judgements are needed to distinguish between the two types of sentences, and the document provides an operational definition of "ungrammatical" to aid such judgements.
This document presents two new approaches for aligning sentences in parallel English-Arabic corpora: mathematical regression (MR) and genetic algorithm (GA) classifiers. Feature vectors containing text features like length, punctuation score, and cognate score are extracted from sentence pairs and used to train the MR and GA models on manually aligned training data. The trained models are then tested on additional sentence pairs, achieving better results than a baseline length-based approach. The methods can be applied to any language pair by modifying the feature vector.
Phrase structure grammar models the internal structure of sentences in a hierarchical organization. It represents sentences as consisting of phrases, which are made up of words, which are made up of morphemes and phonemes. Phrase structure grammars use rewrite rules to break down syntactic structures into their constituent parts in a step-by-step manner. Deep structure represents the underlying meaning of a sentence, while surface structure is the actual form used. Transformational rules derive surface structure from deep structure.
Setswana Tokenisation and Computational Verb Morphology: Facing the Challenge...Guy De Pauw
This document discusses tokenization and computational verb morphology for Setswana, a Bantu language with a disjunctive orthography. It presents an approach that combines two tokenization transducers and a morphological analyzer to effectively tokenize Setswana text. The approach was tested on a short Setswana text and achieved 93.6% accuracy between the automatically and hand-tokenized texts. While mostly successful, some issues remained around longest matches that were not valid tokens or did not allow morphological analysis. Overall, the approach demonstrated that a precise tokenizer and morphological analyzer can largely resolve the challenges of Setswana's disjunctive writing system.
Natural Language Generation from First-Order ExpressionsThomas Mathew
In this paper I discuss an approach for generating natural language from a first-order logic representation. The approach shows that a grammar definition for the natural language and a
lambda calculus based semantic rule collection can be applied for bi-directional translation using an overgenerate-and-prune mechanism.
Finite-state morphological parsing uses finite-state transducers to parse words into their morphological components like stems and affixes. It requires a lexicon of stems and affixes, morphotactic rules describing valid morpheme combinations, and orthographic rules for spelling changes. The parser is built as a cascade of finite-state automata representing the lexicon, morphotactics and spelling rules. It maps surface word forms onto their underlying lexical representations including stems and morphological features. This allows morphological analysis of both regular and irregular forms.
A COMPUTATIONAL APPROACH FOR ANALYZING INTER-SENTENTIAL ANAPHORIC PRONOUNS IN...ijnlc
This paper presents a strategy and a computational model for solving inter-sentential anaphoric pronouns
in Vietnamese paragraphs composing simple sentences. The strategy is proposed based on grammatical
features of nouns and the focus phenomenon when using pronouns in Vietnamese. In this research, we
consider only nouns and pronouns which are human objects in the paragraph, and each anaphoric
pronoun will appear one time in one sentence and can appear in adjacent sentences. The computational
model is implemented in Prolog and based on applying and improving the models of Mark Johnson and
Ewan Klein, had been improved by Covington and Schmitz, with theoretical background of Discourse
Representation Theory.Analysis of test results shows that this approach which based on linguistic theories
helps for well solving inter-sentential anaphoric pronouns in Vietnamese paragraphs.
The document discusses context-free grammars for modeling English syntax. It introduces key concepts like constituency, grammatical relations, and subcategorization. Context-free grammars use rules and symbols to generate sentences. They consist of terminal symbols (words), non-terminal symbols (phrases), and rules to expand non-terminals. Context-free grammars can model syntactic knowledge and generate sentences in both a top-down and bottom-up manner through parsing.
Natural Language Processing (NLP) involves using computational techniques to analyze and understand human languages. Key techniques in NLP include sentiment analysis to classify emotions in text, text classification to categorize text into predefined tags or categories, and tokenization which breaks text into discrete words and punctuation. NLP is used to teach machines how to read and understand human languages by identifying relationships between words and entities. Other areas of NLP include parts of speech tagging, constituent structure analysis, and analysis of pronunciation, morphology, syntax, semantics, and pragmatics.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
The Application of Distributed Morphology to the Lithuanian First Accent Noun...Adrian Lin
In this paper I apply the theoretical linguistics framework of distributed morphology to analyse the Lithuanian first accent noun declension paradigm. Lithuanian is an Indo-European language that retains numerous archaic features lost in its other Indo-European languages. Its nouns has 12 declensions and, like other aspects of its grammar, is very complex. As such, this topic is well suited by a distributed morphology analysis.
This document provides an overview of the Minimalist Program (MP) in linguistics. It discusses the following key points:
1. The MP aims to develop a simple linguistic model with minimal components and operations. It builds on principles of economy of derivation and representation from earlier theories like Government and Binding Theory.
2. Core concepts in the MP framework include morphosyntactic features, uninterpretable features, interpretable features, phases of derivation, probes and goals. Derivations proceed through numeration, spell-out at phases, and interpretation at the interfaces of phonetic form and logical form.
3. A phase is a syntactic domain like CP or VP that structures the derivation. Probes are
Corpus-based part-of-speech disambiguation of PersianIDES Editor
In this paper we introduce a method for part-ofspeech
disambiguation of Persian texts, which uses word class
probabilities in a relatively small training corpus in order to
automatically tag unrestricted Persian texts. The experiment
has been carried out in two levels as unigram and bi-gram
genotypes disambiguation. Comparing the results gained from
the two levels, we show that using immediate right context to
which a given word belongs can increase the accuracy rate of
the system to a high degree
Parts-of-speech can be divided into closed classes and open classes. Closed classes have a fixed set of members like prepositions, while open classes like nouns and verbs are continually changing with new words being created. Parts-of-speech tagging is the process of assigning a part-of-speech tag to each word using statistical models trained on tagged corpora. Hidden Markov Models are commonly used, where the goal is to find the most probable tag sequence given an input word sequence.
Relative clause with their equivalence from English into Indonesianimadejuliarta
This document provides an introduction and background to a study on translating relative clauses from English to Indonesian. It discusses translation theory and defines translation as transferring meaning from the source language to the target language. The document reviews previous studies on analyzing relative clauses and nominal clauses. It establishes the problems of this study, which are to identify types of relative clauses, analyze their syntactic functions, and investigate translation equivalence between the source and target languages. The significance of the study is its theoretical contribution to translation and linguistics, as well as its practical application for language learners. The scope is focused on analyzing relative clause types and translation procedures from English to Indonesian.
The document discusses Performance Grammar (PG), a psycholinguistically motivated grammar formalism that aims to describe syntactic processing phenomena in language production and comprehension. PG uses a two-stage process to generate sentences, first assembling an unordered hierarchical structure and then applying linear precedence rules to order constituents from left to right. However, a purely precedence-based approach cannot account for incremental sentence production. To address this, PG supplements precedence rules with rules that assign constituents to absolute positions within a constituent's spatial layout.
This paper performs a detailed analysis on the alignment of Portuguese contractions, based on a previously aligned bilingual corpus. The alignment task was performed manually in a subset of the English-Portuguese CLUE4Translation Alignment Collection. The initial parallel corpus was pre-processed and a decision was made as to whether the contraction should be maintained or decomposed in the alignment. Decomposition was required in the cases in which the two words that have been concatenated, i.e., the preposition and the determiner or pronoun, go in two separate translation alignment pairs (PT- [no seio de] [a União Europeia] EN- [within] [the European Union]). Most contractions required decomposition in contexts where they are positioned at the end of a multiword unit. On the other hand, contractions tend to be maintained when they occur at the beginning or in the middle of the multiword unit, i.e., in the frozen part of the multiword (PT- [no que diz respeito a] EN- [with regard to] or PT- [além disso] EN-[in addition]. A correct alignment of multiwords and phrasal units containing contractions is instrumental for machine translation, paraphrasing, and variety adaptation.
This document discusses parsing with context-free grammars. It begins by introducing context-free grammars and their use in parsing sentences. It then discusses parsing as a search problem, and presents top-down and bottom-up parsing algorithms. Top-down parsing builds trees from the root node down, while bottom-up parsing builds trees from the leaves up. Both approaches have advantages and disadvantages related to efficiency. The document also introduces probabilistic context-free grammars, which augment grammars with rule probabilities, and discusses how these can be used for disambiguation.
This document provides an overview of grammatical contrastive analysis, focusing on inflectional morphology. It discusses key grammatical categories such as aspect, case, gender, mood, number, and tense, and how they are conveyed through inflectional morphemes in some languages versus lexical means in others. The document uses examples from English and Chinese to illustrate differences in how these categories are expressed grammatically or lexically across languages. It aims to lay the groundwork for contrasting the morphological and lexical devices languages employ to convey grammatical meanings.
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMA...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals.
This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural
language processing. We will suggest a real-time solution, based on the idea of using some words or
groups of words that indicate grammatical category; and some specific endings of some parts of sentence.
Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or
some specific endings can provide a lot of information about the structure of a complex sentence. Such
characteristics can be found in other languages, too, such as French. Using a special grammar, we
developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and
interogative sentences about a “story” (a set of sentences describing some events from the real life).
Syntactic Analysis Based on Morphological characteristic Features of the Roma...kevig
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals.
This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural
language processing. We will suggest a real-time solution, based on the idea of using some words or
groups of words that indicate grammatical category; and some specific endings of some parts of sentence.
Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or
some specific endings can provide a lot of information about the structure of a complex sentence. Such
characteristics can be found in other languages, too, such as French. Using a special grammar, we
developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and interogative sentences about a “story” (a set of sentences describing some events from the real life).
The document discusses realistic grammatical description and argues for maintaining a distinction between grammatical and ungrammatical sentences. It proposes that grammars developed for natural language processing should have two components: one describing grammatical sentences and one describing ungrammatical sentences observed in natural language use. Grammaticality judgements are needed to distinguish between the two types of sentences, and the document provides an operational definition of "ungrammatical" to aid such judgements.
This document presents two new approaches for aligning sentences in parallel English-Arabic corpora: mathematical regression (MR) and genetic algorithm (GA) classifiers. Feature vectors containing text features like length, punctuation score, and cognate score are extracted from sentence pairs and used to train the MR and GA models on manually aligned training data. The trained models are then tested on additional sentence pairs, achieving better results than a baseline length-based approach. The methods can be applied to any language pair by modifying the feature vector.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRYkevig
In this paper we present a set of experiments carried out with BERT on a number of Italian sentences taken
from poetry domain. The experiments are organized on the hypothesis of a very high level of difficulty in
predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and
semantic level. To test this hypothesis we ran the Italian version of BERT with 80 sentences - for a total of
900 tokens – mostly extracted from Italian poetry of the first half of last century. Then we alternated
canonical and non-canonical versions of the same sentence before processing them with the same DL
model. We used then sentences from the newswire domain containing similar syntactic structures. The
results show that the DL model is highly sensitive to presence of non-canonical structures. However, DLs
are also very sensitive to word frequency and to local non-literal meaning compositional effect. This is also
apparent by the preference for predicting function vs content words, collocates vs infrequent word phrases.
In the paper, we focused our attention on the use of subword units done by BERT for out of vocabulary
words.
STRESS TEST FOR BERT AND DEEP MODELS: PREDICTING WORDS FROM ITALIAN POETRYkevig
In this paper we present a set of experiments carried out with BERT on a number of Italian sentences taken
from poetry domain. The experiments are organized on the hypothesis of a very high level of difficulty in
predictability at the three levels of linguistic complexity that we intend to monitor: lexical, syntactic and
semantic level. To test this hypothesis we ran the Italian version of BERT with 80 sentences - for a total of
900 tokens – mostly extracted from Italian poetry of the first half of last century. Then we alternated
canonical and non-canonical versions of the same sentence before processing them with the same DL
model. We used then sentences from the newswire domain containing similar syntactic structures. The
results show that the DL model is highly sensitive to presence of non-canonical structures. However, DLs
are also very sensitive to word frequency and to local non-literal meaning compositional effect. This is also
apparent by the preference for predicting function vs content words, collocates vs infrequent word phrases.
In the paper, we focused our attention on the use of subword units done by BERT for out of vocabulary
words.
This document discusses methods for aligning word senses between languages using probabilistic sense distributions. It proposes two approaches: 1) Using only monolingual corpora and aligning senses based on similar sense distributions between closely related languages. 2) Leveraging parallel corpora to estimate sense distribution alignments and the most probable translation for each source sense. The approaches are tested on the Europarl corpus, first ignoring and then exploiting sentence alignments. Several examples are examined to validate the sense alignments. Key aspects include using word sense disambiguation to annotate corpora, estimating sense assignment distributions, and assigning translation weights between language pairs based on relative sense frequencies.
. . . all human languages do share the same structure. More e.docxadkinspaige22
The document discusses the Universal Grammar (UG) approach to linguistics and second language acquisition. It proposes that UG posits that all human languages share the same underlying structure and are constrained by a set of innate linguistic principles and parameters. The UG approach aims to describe the mental representation of language, explain how language is acquired, and characterize what knowledge of language consists of. For second language learning, UG may fully constrain the second language grammar, or UG constraints may be impaired or not apply.
. . . all human languages do share the same structure. More e.docxShiraPrater50
. . . all human languages do share the same structure. More explicitly: they have
essentially the same primitive elements and rules of composition . . . , although
of course there may be variations, such as the obvious ones derived from the
arbitrary association between sounds and meanings . . .
(Moro, 2016, p. 15)
3.1 Introduction
In this chapter, we start to consider individual theoretical perspectives on
L2 learning in greater detail. Our first topic is the Universal Grammar (UG)
approach (the generative linguistics approach), developed by the American
linguist Noam Chomsky and numerous followers over the last few decades.
This has been the most influential linguistic theory in the field, and has
inspired a wealth of publications (for full-length treatments, see Hawkins,
2001; Herschensohn, 2000; Lardiere, 2007; Leung, 2009; Slabakova, 2016;
Snape & Kupisch, 2016; Thomas, 2004; White, 2003; Whong, Gil, & Mars-
den, 2013).
The main aim of linguistic theory is twofold: firstly, to characterize what
human languages are like (descriptive adequacy), and, secondly, to explain
why they are that way (explanatory adequacy). In terms of L2 acquisition,
a linguistic approach sets out to describe the evolving language produced
by L2 learners, and to explain its characteristics. UG is therefore a prop-
erty theory (as defined in Chapter 1); that is, it attempts to characterize
the underlying linguistic knowledge in L2 learners’ minds. In contrast, a
detailed examination of the learning process itself (transition theory) will
be the main concern of the cognitive approaches which we describe in
Chapters 4 and 5.
First in this chapter, we will give a broad definition of the aims of the
Chomskyan tradition in linguistic research, in order to identify the aspects
of second language acquisition (SLA) to which this tradition is most relevant.
Secondly, we will examine the concept of UG itself in some detail, and
finally we will consider its application in L2 learning research.
3 Linguistics and Language
Learning
The Universal Grammar
Approach
C
o
p
y
r
i
g
h
t
2
0
1
9
.
R
o
u
t
l
e
d
g
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
M
a
y
n
o
t
b
e
r
e
p
r
o
d
u
c
e
d
i
n
a
n
y
f
o
r
m
w
i
t
h
o
u
t
p
e
r
m
i
s
s
i
o
n
f
r
o
m
t
h
e
p
u
b
l
i
s
h
e
r
,
e
x
c
e
p
t
f
a
i
r
u
s
e
s
p
e
r
m
i
t
t
e
d
u
n
d
e
r
U
.
S
.
o
r
a
p
p
l
i
c
a
b
l
e
c
o
p
y
r
i
g
h
t
l
a
w
.
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 4/6/2020 6:13 AM via UNIVERSITY OF ESSEX
AN: 2006630 ; Mitchell, Rosamond, Myles, Florence, Marsden, Emma.; Second Language Learning Theories : Fourth Edition
Account: s9814295.main.ehost
82 Linguistics and Language Learning
3.2 Why a Universal Grammar?
3.2.1 Aims of Linguistic Research
The main goals of linguistic theory, as defined by Chomsky (1986a), are to
answer three basic questions about human language:
1. What constitutes knowledge of language?
2 ...
Natural language processing (NLP) is a subfield of artificial intelligence that aims to allow computers to understand human language. NLP involves analyzing and representing text or speech at different linguistic levels for applications like question answering or machine translation. Challenges for NLP include ambiguities in language like lexical, syntactic, semantic, and anaphoric ambiguities. Common NLP tasks include part-of-speech tagging, parsing, named entity recognition, and sentiment analysis. Applications of NLP include text processing, machine translation, speech processing, and converting text to speech.
05 linguistic theory meets lexicographyDuygu Aşıklar
This document discusses the application of linguistic theories to lexicography. It reviews theories like frame semantics that can help lexicographers analyze data and produce dictionary entries. Frame semantics describes how words relate to semantic frames and contexts. The document explains key concepts for lexicographers like sense relationships, inherent properties of words, and what information is relevant for dictionary entries. It provides examples and guidelines for analyzing words using frame semantics and determining lexicographic relevance from corpus data.
Natural language processing (NLP) aims to help computers understand human language. Ambiguity is a major challenge for NLP as words and sentences can have multiple meanings depending on context. There are different types of ambiguity including lexical ambiguity where a word has multiple meanings, syntactic ambiguity where sentence structure is unclear, and semantic ambiguity where meaning depends on broader context. NLP techniques like part-of-speech tagging and word sense disambiguation aim to resolve ambiguity by analyzing context.
This document discusses contrastive analysis as a tool for comparing two languages to identify similarities and differences. It can be used to predict difficulties for language learners by examining differences between their first language (L1) and the target second language (L2). The document outlines the basic steps of contrastive analysis, including describing the phonemic inventories and comparing sounds, syntax, and other linguistic features between L1 and L2. Contrastive analysis was an early and influential theory for predicting language learning difficulties but has limitations and has been supplemented by other approaches.
This document discusses natural language processing (NLP) and provides summaries of key concepts:
1) NLP aims to help computers understand and manipulate human language to perform useful tasks by drawing on linguistics, computer science, and other fields.
2) There are four main approaches to NLP: symbolic, statistical, connectionist, and hybrid methods.
3) NLP has many applications including automatic essay scoring, requirements analysis, controlling home devices with voice commands, semantic web services, and detecting social engineering attacks in email.
This document summarizes an approach for identifying word translations from non-parallel (unrelated) English and German corpora. It uses a co-occurrence clue, assuming words that frequently co-occur in one language will have translations that co-occur frequently in the other language. The approach computes association vectors for words based on log-likelihood ratios of co-occurrence counts within a 3-word window. It determines the translation of an unknown German word by finding the most similar association vector in an English co-occurrence matrix. Evaluation on 100 test words showed an accuracy of around 72%, a significant improvement over previous methods for non-parallel texts.
This document discusses the key concepts of generative grammar including:
- Generative grammar defines syntactic structures and generates all grammatical sentences of a language using a finite set of rules.
- Syntax is the study of how words are combined into phrases and sentences. Phrases are groupings of words headed by a lexical category.
- Sentences contain lexical categories like nouns, verbs, adjectives as well as functional categories like determiners and auxiliaries.
- Verbs select complements like objects, predicates, and clauses that are required, while adjuncts provide optional details like time and manner.
- Recursion allows categories to embed within each other, generating infinitely long phrases.
This document discusses different types of human knowledge and how language acquisition works. It begins by distinguishing between linguistic knowledge, which is knowledge about language, and non-linguistic knowledge. It notes that language has its own principles and rules that are different from other cognitive systems. Two approaches to language acquisition are described - one that views language as developing through general intellectual development, and one that argues language may require special genetic programming. The document then discusses how linguistic intuitions of native speakers can provide insights into their linguistic knowledge and syntactic judgments. It defines the difference between competence, a speaker's innate linguistic knowledge, and performance, their actual language use. Finally, it notes both universal aspects of language shared across languages but also areas of variation between
A New Approach For Paraphrasing And Rewording A Challenging TextKate Campbell
This document presents a new approach for paraphrasing and rewording challenging texts. It discusses intralingual translation (rewording or paraphrasing) and aims to show how to simplify complex English texts. The approach follows Noam Chomsky's generative–transformational model and Larson's methodology to analyze texts at various levels. It examines concepts like deep structure, surface structure, kernels/propositions, and skewing between grammar and semantics. Specific techniques for rewording are also outlined, including using synonyms, substitute words, and reciprocal words. An example analysis of rewording a passage from "Taj Mahal" is provided to demonstrate applying the paraphrasing steps.
Lectura 3.5 word normalizationintwitter finitestate_transducersMatias Menendez
This document describes a system for normalizing Spanish tweets using finite-state transducers. The system consists of three main components: 1) An analyzer that tokenizes tweets and identifies standard words, 2) Transducers that generate candidate words for out-of-vocabulary tokens by implementing linguistic models of spelling variation, 3) A statistical language model that selects the most probable sequence of words. The transducers are based on weighted finite-state automata and implement models of texting features such as logograms, shortenings, and non-standard spellings. An evaluation shows the system is effective at normalizing Spanish tweets.
Similar to SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMANIAN LANGUAGE (20)
Investigation of the Effect of Obstacle Placed Near the Human Glottis on the ...kevig
Simulation of human vocal tract model is necessary to understand the effect of different
conditions on speech generation. In natural calamities such as earthquakes person may swallow
soil due to large amount of dust around. The effect of soil is investigated on the sound pressure in
the human vocal tract. The generation of speech starts from the glottis, so the soil obstacle is
positioned near it. The whole model is designed and analyzed using FEM. As the speech signal
requires a bandwidth near about 4 kHz, the investigation carried out to obtained sound pressure
in the human vocal tract from 100-5000 Hz sound signals.
Identification and Classification of Named Entities in Indian Languageskevig
The process of identification of Named Entities (NEs) in a given document and then there classification into
different categories of NEs is referred to as Named Entity Recognition (NER). We need to do a great effort
in order to perform NER in Indian languages and achieve the same or higher accuracy as that obtained by
English and the European languages. In this paper, we have presented the results that we have achieved by
performing NER in Hindi, Bengali and Telugu using Hidden Markov Model (HMM) and Performance
Metrics.
Effect of Query Formation on Web Search Engine Resultskevig
Query in a search engine is generally based on natural language. A query can be expressed in more than
one way without changing its meaning as it depends on thinking of human being at a particular moment.
Aim of the searcher is to get most relevant results immaterial of how the query has been expressed. In the
present paper, we have examined the results of search engine for change in coverage and similarity of first
few results when a query is entered in two semantically same but in different formats. Searching has been
made through Google search engine. Fifteen pairs of queries have been chosen for the study. The t-test has
been used for the purpose and the results have been checked on the basis of total documents found,
similarity of first five and first ten documents found in the results of a query entered in two different
formats. It has been found that the total coverage is same but first few results are significantly different.
Investigations of the Distributions of Phonemic Durations in Hindi and Dogrikevig
Speech generation is one of the most important areas of research in speech signal processing which is now gaining a serious attention. Speech is a natural form of communication in all living things. Computers with the ability to understand speech and speak with a human like voice are expected to contribute to the development of more natural man-machine interface. However, in order to give those functions that are even closer to those of human beings, we must learn more about the mechanisms by which speech is produced and perceived, and develop speech information processing technologies that can generate a more natural sounding systems. The so described field of stud, also called speech synthesis and more prominently acknowledged as text-to-speech synthesis, originated in the mid eighties because of the emergence of DSP and the rapid advancement of VLSI techniques. To understand this field of speech, it is necessary to understand the basic theory of speech production. Every language has different phonetic alphabets and a different set of possible phonemes and their combinations.
For the analysis of the speech signal, we have carried out the recording of five speakers in Dogri (3 male and 5 females) and eight speakers in Hindi language (4 male and 4 female). For estimating the durational distributions, the mean of mean of ten instances of vowels of each speaker in both the languages has been calculated. Investigations have shown that the two durational distributions differ significantly with respect to mean and standard deviation. The duration of phoneme is speaker dependent. The whole investigation can be concluded with the end result that almost all the Dogri phonemes have shorter duration, in comparison to Hindi phonemes. The period in milli seconds of same phonemes when uttered in Hindi were found to be longer compared to when they were spoken by a person with Dogri as his mother tongue. There are many applications which are directly of indirectly related to the research being carried out. For instance the main application may be for transforming Dogri speech into Hindi and vice versa, and further utilizing this application, we can develop a speech aid to teach Dogri to children. The results may also be useful for synthesizing the phonemes of Dogri using the parameters of the phonemes of Hindi and for building large vocabulary speech recognition systems.
May 2024 - Top10 Cited Articles in Natural Language Computingkevig
Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze, understand, and generate languages that humans use naturally to address computers.
Effect of Singular Value Decomposition Based Processing on Speech Perceptionkevig
Speech is an important biological signal for primary mode of communication among human being and also the most natural and efficient form of exchanging information among human in speech. Speech processing is the most important aspect in signal processing. In this paper the theory of linear algebra called singular value decomposition (SVD) is applied to the speech signal. SVD is a technique for deriving important parameters of a signal. The parameters derived using SVD may further be reduced by perceptual evaluation of the synthesized speech using only perceptually important parameters, where the speech signal can be compressed so that the information can be transformed into compressed form without losing its quality. This technique finds wide applications in speech compression, speech recognition, and speech synthesis. The objective of this paper is to investigate the effect of SVD based feature selection of the input speech on the perception of the processed speech signal. The speech signal which is in the form of vowels \a\, \e\, \u\ were recorded from each of the six speakers (3 males and 3 females). The vowels for the six speakers were analyzed using SVD based processing and the effect of the reduction in singular values was investigated on the perception of the resynthesized vowels using reduced singular values. Investigations have shown that the number of singular values can be drastically reduced without significantly affecting the perception of the vowels.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4,
demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to
identify which specific terms in prompts positively or negatively impact relevance
evaluation with LLMs. We employed two types of prompts: those used in previous
research and generated automatically by LLMs. By comparing the performance of
these prompts in both few-shot and zero-shot settings, we analyze the influence of
specific terms in the prompts. We have observed two main findings from our study.
First, we discovered that prompts using the term ‘answer’ lead to more effective
relevance evaluations than those using ‘relevant.’ This indicates that a more direct
approach, focusing on answering the query, tends to enhance performance. Second,
we noted the importance of appropriately balancing the scope of ‘relevance.’ While
the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments.
The inclusion of few-shot examples helps in more precisely defining this balance.
By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute
to refine relevance criteria. In conclusion, our study highlights the significance of
carefully selecting terms in prompts for relevance evaluation with LLMs.
Identifying Key Terms in Prompts for Relevance Evaluation with GPT Modelskevig
Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4, demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to identify which specific terms in prompts positively or negatively impact relevance evaluation with LLMs. We employed two types of prompts: those used in previous research and generated automatically by LLMs. By comparing the performance of these prompts in both few-shot and zero-shot settings, we analyze the influence of specific terms in the prompts. We have observed two main findings from our study. First, we discovered that prompts using the term ‘answer’ lead to more effective relevance evaluations than those using ‘relevant.’ This indicates that a more direct approach, focusing on answering the query, tends to enhance performance. Second, we noted the importance of appropriately balancing the scope of ‘relevance.’ While the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments. The inclusion of few-shot examples helps in more precisely defining this balance. By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute to refine relevance criteria. In conclusion, our study highlights the significance of carefully selecting terms in prompts for relevance evaluation with LLMs.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
Genetic Approach For Arabic Part Of Speech Taggingkevig
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
Rule Based Transliteration Scheme for English to Punjabikevig
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
Improving Dialogue Management Through Data Optimizationkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and computers through natural language stands as a critical advancement. Central to these systems is the dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue management has embraced a variety of methodologies, including rule-based systems, reinforcement learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset imperfections, especially mislabeling, on the challenges inherent in refining dialogue management processes.
Document Author Classification using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statisticalnaturallanguage parserwere explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
Rag-Fusion: A New Take on Retrieval Augmented Generationkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots, but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal scores and fusing the documents and scores. Through manually evaluating answers on accuracy, relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and comprehensive answers due to the generated queries contextualizing the original query from various perspectives. However, some answers strayed off topic when the generated queries' relevance to the original query is insufficient. This research marks significant progress in artificial intelligence (AI) and natural language processing (NLP) applications and demonstrates transformations in a global and multi-industry context.
Performance, Energy Consumption and Costs: A Comparative Analysis of Automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their performance. However, this often leads to overlooking other crucial aspects that should be given careful consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches (SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance parameters (standard indices), but also alternative measures such as timing, energy consumption and costs, which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of complex models (LLM and generative models), consuming much less energy and requiring fewer resources. These findings suggest that companies should consider additional considerations when choosing machine learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
Evaluation of Medium-Sized Language Models in German and English Languagekevig
Large language models (LLMs) have garnered significant attention, but the definition of “large” lacks clarity. This paper focuses on medium-sized language models (MLMs), defined as having at least six billion parameters but less than 100 billion. The study evaluates MLMs regarding zero-shot generative question answering, which requires models to provide elaborate answers without external document retrieval. The paper introduces an own test dataset and presents results from human evaluation. Results show that combining the best answers from different MLMs yielded an overall correct answer rate of 82.7% which is better than the 60.9% of ChatGPT. The best MLM achieved 71.8% and has 33B parameters, which highlights the importance of using appropriate training data for fine-tuning rather than solely relying on the number of parameters. More fine-grained feedback should be used to further improve the quality of answers. The open source community is quickly closing the gap to the best commercial models.
IMPROVING DIALOGUE MANAGEMENT THROUGH DATA OPTIMIZATIONkevig
In task-oriented dialogue systems, the ability for users to effortlessly communicate with machines and
computers through natural language stands as a critical advancement. Central to these systems is the
dialogue manager, a pivotal component tasked with navigating the conversation to effectively meet user
goals by selecting the most appropriate response. Traditionally, the development of sophisticated dialogue
management has embraced a variety of methodologies, including rule-based systems, reinforcement
learning, and supervised learning, all aimed at optimizing response selection in light of user inputs. This
research casts a spotlight on the pivotal role of data quality in enhancing the performance of dialogue
managers. Through a detailed examination of prevalent errors within acclaimed datasets, such as
Multiwoz 2.1 and SGD, we introduce an innovative synthetic dialogue generator designed to control the
introduction of errors precisely. Our comprehensive analysis underscores the critical impact of dataset
imperfections, especially mislabeling, on the challenges inherent in refining dialogue management
processes.
Document Author Classification Using Parsed Language Structurekevig
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the
text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used,
for example, to determine authorship of all of The Federalist Papers. Such methods may be useful in more modern
times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of
using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship
using grammatical structural information extracted using a statistical natural language parser. This paper provides a
proof of concept, testing author classification based on grammatical structure on a set of “proof texts,” The Federalist
Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted
of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the
features into a lower dimensional space. Statistical experiments on these documents demonstrate that information
from a statistical parser can, in fact, assist in distinguishing authors.
RAG-FUSION: A NEW TAKE ON RETRIEVALAUGMENTED GENERATIONkevig
Infineon has identified a need for engineers, account managers, and customers to rapidly obtain product
information. This problem is traditionally addressed with retrieval-augmented generation (RAG) chatbots,
but in this study, I evaluated the use of the newly popularized RAG-Fusion method. RAG-Fusion combines
RAG and reciprocal rank fusion (RRF) by generating multiple queries, reranking them with reciprocal
scores and fusing the documents and scores. Through manually evaluating answers on accuracy,
relevance, and comprehensiveness, I found that RAG-Fusion was able to provide accurate and
comprehensive answers due to the generated queries contextualizing the original query from various
perspectives. However, some answers strayed off topic when the generated queries' relevance to the
original query is insufficient. This research marks significant progress in artificial intelligence (AI) and
natural language processing (NLP) applications and demonstrates transformations in a global and multiindustry context
Performance, energy consumption and costs: a comparative analysis of automati...kevig
The common practice in Machine Learning research is to evaluate the top-performing models based on their
performance. However, this often leads to overlooking other crucial aspects that should be given careful
consideration. In some cases, the performance differences between various approaches may be insignificant, whereas factors like production costs, energy consumption, and carbon footprint should be taken into
account. Large Language Models (LLMs) are widely used in academia and industry to address NLP problems. In this study, we present a comprehensive quantitative comparison between traditional approaches
(SVM-based) and more recent approaches such as LLM (BERT family models) and generative models (GPT2 and LLAMA2), using the LexGLUE benchmark. Our evaluation takes into account not only performance
parameters (standard indices), but also alternative measures such as timing, energy consumption and costs,
which collectively contribute to the carbon footprint. To ensure a complete analysis, we separately considered the prototyping phase (which involves model selection through training-validation-test iterations) and
the in-production phases. These phases follow distinct implementation procedures and require different resources. The results indicate that simpler algorithms often achieve performance levels similar to those of
complex models (LLM and generative models), consuming much less energy and requiring fewer resources.
These findings suggest that companies should consider additional considerations when choosing machine
learning (ML) solutions. The analysis also demonstrates that it is increasingly necessary for the scientific
world to also begin to consider aspects of energy consumption in model evaluations, in order to be able to
give real meaning to the results obtained using standard metrics (Precision, Recall, F1 and so on).
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Software Engineering and Project Management - Software Testing + Agile Method...Prakhyath Rai
Software Testing: A Strategic Approach to Software Testing, Strategic Issues, Test Strategies for Conventional Software, Test Strategies for Object -Oriented Software, Validation Testing, System Testing, The Art of Debugging.
Agile Methodology: Before Agile – Waterfall, Agile Development.
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
Blood finder application project report (1).pdfKamal Acharya
Blood Finder is an emergency time app where a user can search for the blood banks as
well as the registered blood donors around Mumbai. This application also provide an
opportunity for the user of this application to become a registered donor for this user have
to enroll for the donor request from the application itself. If the admin wish to make user
a registered donor, with some of the formalities with the organization it can be done.
Specialization of this application is that the user will not have to register on sign-in for
searching the blood banks and blood donors it can be just done by installing the
application to the mobile.
The purpose of making this application is to save the user’s time for searching blood of
needed blood group during the time of the emergency.
This is an android application developed in Java and XML with the connectivity of
SQLite database. This application will provide most of basic functionality required for an
emergency time application. All the details of Blood banks and Blood donors are stored
in the database i.e. SQLite.
This application allowed the user to get all the information regarding blood banks and
blood donors such as Name, Number, Address, Blood Group, rather than searching it on
the different websites and wasting the precious time. This application is effective and
user friendly.
Mechatronics is a multidisciplinary field that refers to the skill sets needed in the contemporary, advanced automated manufacturing industry. At the intersection of mechanics, electronics, and computing, mechatronics specialists create simpler, smarter systems. Mechatronics is an essential foundation for the expected growth in automation and manufacturing.
Mechatronics deals with robotics, control systems, and electro-mechanical systems.
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
Applications of artificial Intelligence in Mechanical Engineering.pdfAtif Razi
Historically, mechanical engineering has relied heavily on human expertise and empirical methods to solve complex problems. With the introduction of computer-aided design (CAD) and finite element analysis (FEA), the field took its first steps towards digitization. These tools allowed engineers to simulate and analyze mechanical systems with greater accuracy and efficiency. However, the sheer volume of data generated by modern engineering systems and the increasing complexity of these systems have necessitated more advanced analytical tools, paving the way for AI.
AI offers the capability to process vast amounts of data, identify patterns, and make predictions with a level of speed and accuracy unattainable by traditional methods. This has profound implications for mechanical engineering, enabling more efficient design processes, predictive maintenance strategies, and optimized manufacturing operations. AI-driven tools can learn from historical data, adapt to new information, and continuously improve their performance, making them invaluable in tackling the multifaceted challenges of modern mechanical engineering.
Height and depth gauge linear metrology.pdfq30122000
Height gauges may also be used to measure the height of an object by using the underside of the scriber as the datum. The datum may be permanently fixed or the height gauge may have provision to adjust the scale, this is done by sliding the scale vertically along the body of the height gauge by turning a fine feed screw at the top of the gauge; then with the scriber set to the same level as the base, the scale can be matched to it. This adjustment allows different scribers or probes to be used, as well as adjusting for any errors in a damaged or resharpened probe.
Digital Twins Computer Networking Paper Presentation.pptxaryanpankaj78
A Digital Twin in computer networking is a virtual representation of a physical network, used to simulate, analyze, and optimize network performance and reliability. It leverages real-time data to enhance network management, predict issues, and improve decision-making processes.
Digital Twins Computer Networking Paper Presentation.pptx
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL CHARACTERISTIC FEATURES OF THE ROMANIAN LANGUAGE
1. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
DOI : 10.5121/ijnlc.2012.1401 1
SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL
CHARACTERISTIC FEATURES OF THE ROMANIAN
LANGUAGE
Bogdan Pătruţ1,2
1
Department of Mathematics and Informatics, Vasile Alecsandri University of Bacau,
Bacau, Romania
2
EduSoft Bacau, Romania
1,2
bogdan@edusoft.ro
ABSTRACT
This paper gives complete guidelines for authors submitting papers for the AIRCC Journals.
This paper refers to the syntactic analysis of phrases in Romanian, as an important process of natural
language processing. We will suggest a real-time solution, based on the idea of using some words or
groups of words that indicate grammatical category; and some specific endings of some parts of sentence.
Our idea is based on some characteristics of the Romanian language, where some prepositions, adverbs or
some specific endings can provide a lot of information about the structure of a complex sentence. Such
characteristics can be found in other languages, too, such as French. Using a special grammar, we
developed a system (DIASEXP) that can perform a dialogue in natural language with assertive and
interogative sentences about a “story” (a set of sentences describing some events from the real life).
KEYWORDS
Natural Language Processing, Syntactic Analysis, Morphology, Grammar, Romanian Language
1. INTRODUCTION
Before making a semantic analysis of natural language, we should define the lexicon and
syntactic rules of a formal grammar useful in generating simple sentences in the respective
language (for example, English, French, or Romanian).
We shall consider a simple grammar, having some rules for the lexicon, and some rules for the
grammatical categories. The rules for the lexicon will be of the type (1):
G → W (1)
where G is a grammatical category (or part of speech) and W is an word from a certain dictionary.
The other syntactic rules will be of the type (2):
G1 → G2 G3 (2)
meaning that the first grammatical category (G1) forms out of the concatenation of the other two
(G2 and G3), from the right side of the arrow.
2. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
2
Our simple grammar is presented in Figure 1. It contains only few Romanian words and the
syntax rules constitute a subset of the Romanian syntax rules:
Lexicon:
Det → orice (any)
Det → fiecare (every)
Det → o (a, an (fem.))
Det → un (a, an (masc.))
Pron → el (he)
Pron → ea (she)
N → bărbat (man)
N → femeie (woman)
N → pisică (cat)
N → șoarece (mouse)
V → iubeşte (loves)
V → urăşte (hates)
A → frumoasă (beautiful)
A → deşteaptă (smart (fem.))
A → deştept (smart (masc.))
C → şi (and)
C → sau (or)
Syntax:
S → NP VP
NP → Pron
NP → N
NP → Det N
NP → NP AP
AP → A
AP → AP CP
CA → C A
VP → V VP
VP → V NP
Figure 1. Example of simple grammar for Romanian language
In Figure 1, we used these notations: S = sentence, NP = noun phrase, VP = verb phrase, N =
noun, Det = determiner (article), AP = adjectival phrase, A = adjective, C = conjunction, V = verb,
CA = group made up of a conjunction and an adjective.
This grammar generates correct sentences in English, such as:
• Orice femeie iubeşte. (Every woman loves.)
• Un șoarece urăște o pisică. (A mouse hates a cat.)
• Fiecare bărbat deștept iubește o femeie frumoasă și deșteaptă. (Every smart man
loves a beautiful and smart woman.)
3. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
3
On the other hand, this grammar rejects incorrect phrases, such as “orice iubește un bărbat” (“any
loves a man”). The previously phrases contains words only from the chosen vocabulary.
However, our grammar overgenerates, that is, it generates sentences that are grammatically
incorrect, such as "Ea frumoasă sau deșteaptă iubește" (“She beautiful or smart loves”), “El
iubește el” (“He loves he”), and "O bărbat iubește pisică" (“An man loves cat”), even these
phrases contain words from the selected lexicon.
Also, the grammar subgenerates, meaning that there are many sentences in Romanian that
grammar rejects, such as "Orice femeie iubește sau urăște un bărbat". This phrase is correct in
Romanian, and contains words from the given dictionary. Also, the phrase "niște câini urăsc o
pisică" ("some dogs hate a cat"), although syntactically correct in Romanian, is not accepted
because it contains words that have not been entered into our vocabulary.
The syntactic analysis or the parsing of a string of words may be seen as a process of searching
for a derivation tree. This may be achieved either starting from S and searching for a tree with the
words from the given phrase as leaves (top-down parsing) or starting from the words and
searching for a tree with the root S (bottom-up parsing). An algorithm of efficient parsing is based
on dynamic programming: each time that we analyze the phrase or the string of words, we store
the result so that we may not have to reanalyze it later. For example, as soon as we have
discovered that a string of words is a NP, we may record the result in a data structure called chart.
The algorithms that perform this operation are called chart-parsers. The chart-parser algorithm
uses a polynomial time and a polynomial space. In (Pătruţ & Boghian, 2010) we developed a
chart-parser, based on the Cocke, Younger, and Kasami algorithm. We presented a Delphi
application that analyzes the lexicon and the syntax of a sentence in Romanian. We used a
Chomsky normal form (CNF) grammar (Chomsky, 1965).
2. USING THE DEFINITE CLAUSE GRAMMARS
In order for our grammar not to generate incorrect sentences, we should use the notions of gender,
number, case etc. specifying, for example, that "femeie” and “frumoasă" have the feminine
gender, and the singular number. The string “El iubește ea” is incorrect, because “ea” is in
nominative case, and we should use the “pe” preposition in order to obtain the accusative case.
The correct phrase is “El iubește pe ea”, or even“El o iubește pe ea.” (“He loves her”).
If we take into account the case, grammar is no longer independent from the context: it is not true
that any NP is equal to any other NP irrespective of the context. Nevertheless, if we want to work
with a grammar that is independent from the context, we may split the category NP into two, NPN
and NPA, in order to represent verbal groups in the nominative (subjective), respectively
accusative (objective) case. We shall also have to split the category Pron into two categories,
PronN (including "El" and PronA (including "pe ea" (“her”), which contains the preposition “pe”
in front of the pronoun “ea”) (Russel & Norvig, 2002)
Another issue concerns the agreement between the subject and main verb of the sentence
(predicate). For example, if "Eu" (“I”) is the subject, then "Eu iubesc" (“I love”) is grammatically
correct, whereas "Eu iubește" (“I loves”) is not. Then we shall have to split NPN and NPA into
several alternatives in order to reach the agreement. As we identify more and more distinctions,
we eventually obtain an exponential number.
A more efficient solution is to improve ("augment") the existing grammar rules by using
parameters for non-terminal categories. The categories NP and Pron have parameters called
gender, number and case. The rule for the NP has as arguments the variables gender, number and
case.
4. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
4
This formalism of improvement is called definite clause grammar (DCG) because each grammar
rule may be interpreted as a definite clause in Horn’s logic (Pereira & Warren, 1980), (Giurca,
1997).
Using adequate predicates, a CFG (context free grammar) rule as S → NP VP will be written in
the form of the definite clause NP(s1) VP(s2) => S(s1+s2), with the meaning that if s1 is NP, s2 is
VP, then the concatenation of s1 with s2 is S. DCGs allow us to see parsing as a logical inference
(Klein, 2005). The real benefit of the DCG approach is that we can improve the symbols of
categories with additional arguments, for example the rule NP(gender)→N(gender) turns into the
definite clause N(gender, s1) ⇒ NP(gender, s1), meaning that if s1 is a noun with the gender
gender, then s1 is also a noun phrase with the same gender. Generally, we may supplement a
symbol of category with any number of arguments, and the arguments are parameters that
constitute the subject of unification like in the common inference of Horn’s clauses (Russel &
Norvig, 2002).
Even with the improvements brought by DCG, incorrect sentences may still be overgenerated. In
order deal with the correct verbal groups in some situations, we shall have to split the category V
into two subcategories, one for the verbs with no object and one for the verbs with a single object,
and so on. Thus, we shall have to specify which expressions (groups) may follow each verb, that
is, realize a subcategorization of that verb by the list of objects. An object is a compulsory
expression that follows the verb within a verbal group.
Because there are a lot of problems in dealing with the syntactic analysis or a phrase, the time of
the processing the complex situations of the texts in Romanian language, describing real life
situations, we decided to use some morphological characteristic features of the Romanian words,
that can be useful in order to determine the parts of the sentences.
2. THE IDEA FOR SYNTACTIC ANALYSIS BASED ON MORPHOLOGICAL
CHARACTERISTIC FEATURES OF THE LANGUAGE
As we explain in the introduction, the classic ideas of syntactic analysis uses lexicons
(dictionaries), CFG or DCG grammars and chart-parsers, as that developed by us in (Pătruţ &
Boghian, 2010).
In the previous sections, we note the following:
1. Using a CFG grammar, syntactically correct sentences (in Romanian) are accepted, as
well as incorrect ones.
2. The power of the analysis system (for example a chart-parser), based on such a grammar
depends on the extent of the vocabulary used.
3. The power of a chart-parser can be improved using DCG grammars. This implies to
extent the set of the syntactic rules, with a lot of new rules, using special variables (like
gender, case, number etc.).
As concerns the first and the last issues, let us assume that the system will not be required to
analyze incorrect sentences, therefore we consider the grammar satisfactory. Regarding the
second problem, it could be solved by strongly enriching the vocabulary, a fact that would require
the elaboration and implementing of some data structures and searching techniques as efficient as
possible, which, however, will not function in real time, in some cases. Of course, the main
problem would be to write as comprehensive a grammar as possible, close to the linguistic
realities of Romanian morphology, taking into consideration the diversity of forms (and
5. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
5
meanings) that a sentence can get in the Romanian language. However, the extent and complexity
of grammar leads to slowing the analysis, therefore this will not be done in real-time, in all cases.
In this section we will suggest a real-time solution, based on the idea of using:
• some words or groups of words that indicate grammatical category;
• some specific endings of the inflected words that indicate some parts of sentence.
Our idea is based on some characteristics of the Romanian language, where some prepositions or
some specific endings can provide us a lot of information about the structure of a complex phrase.
Such characteristics can be found in other languages, too, such as French. Using our ideas, we
developed a system that uses a special grammar, which we will explain.
The morphology of the Romanian language allows the developing of a special syntactic analysis
that makes full use of certain characteristics of words when they are inflected (declined or
conjugated) or under different hypostases.
With a view to "understanding" a Romanian sentence, to finding the constituent parts, which
would favor the translation of the sentence into another language, in real time, we suggest a
simple solution, based on patterns:
• there is a minimal vocabulary of key words: prefixes1, linking words, endings;
• a relatively limited grammar is realized in which the terminals are some words, either with
given endings or from another given vocabulary;
• the user is asked to respect some restrictions of sentence word order (relatively a few);
• the user is assumed to be well meaning.
In Figure 2 it is presented the general scheme for such an analyzer (Pătruţ & Boghian, 2012).
In Figure 2, we select the concrete case of the sentence: "Copii cei cuminţi au recitat o poezie
părinţilor, în faţa şcolii." ("The good children recited a poem to their parents, in front of the
school").
In stage 2 predicate nouns and adjectives are identified (introduced by pronouns or prepositions),
direct or indirect objects, "inarticulate" (used without an article) (introduced by articles,
prepositions and prepositional phrases respectively), adverbials; in stage 3 "articled" direct and
indirect objects are identified etc.
Following the logic of processes within the analyzer, we notice that the final result (that is correct
in our case, up to an additional detailing) is approximate because the system is based on the
observations made by us, which are:
• In Romanian, the subject usually precedes the predicate: "copiii" ("the children") before "au
recitat" ("recited");
• There are some words (or groups of words) (that we call prefixes or indicators) that introduce
adverbials ("în faţa..."( "in front of") = place adverbial; "fiindcă..." ("because"), "din cauză
că..." = cause adverbial; "pentru a..." ("for")= purpose adverbial;
• The predicate nouns, the adjectives and the objects (direct and indirect) can be introduced by
indicating prefixes ("lui..." (Ion, Gheorghe etc.) ("to...") = indirect object (or predicate noun),
1
by prefixes we understand words or groups of words, having the role to indicate the "sense" of the next
words from the phrase.
6. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
6
"o..." ("fată" ("girl"), "pisică" ("cat") etc.) ("a…") = direct object, if these have not been
already identified as subject. Thus, it is to assume that we will say: "un băiat citeşte o carte"
("a boy is reading a book") and not "o carte (e ceea ce) citeşte un băiat" or "o carte este citită
de un băiat" ("a book (is that which) a boy is reading" or "a book is being read by a boy"),
therefore the subject precedes the object);
• Some endings offer sufficient information about the nature of the respective words (in the
word ‘părinţilor’ the ending ‘-ilor’ indicates an indirect object2, just like ‘-ei’ from ‘fetei’,
‘mamei’ indicates the same type of object, the ‘-ul’ from ‘băiatul’, ‘creionul’ indicate
however a direct object ("articled").
Figure 2. Stages of the analysis
We thus notice that our sentence, recognized by an analyzer of the type presented above is
(structurally) complex enough as compared to the famous "orice bărbat iubeşte o femeie" ("any
man loves a woman"), given as an example for classic chart-parsers.
It should also be noted that the sentence "I o i pe M, deoarece M e f" ("J l M, because M i b")
could be recognized by the analyzer on the basis of the pattern:
• <subiect> <predicat> pe <complement direct>, deoarece <circumstanţial de cauză>.
• (<subject> <predicate>[on] <direct object>because <cause adverbial>)
Thus, the above "sentence", although meaningless in Romanian, would enter the same category
as: "Ion o iubeşte pe Maria, deoarece Maria este frumoasă" ("John loves Mary because Mary is
beautiful"), a category represented by the pattern mentioned above.
2
Of course, “părinţilor” could be an indirect object as well as a predicative noun, like in the sentence "the
good children recited the poems of their parents in front of the school"/ "copiii cei cuminţi au recitat
poeziile părinţilor, în faţa şcolii", therefore frequently not even grammar can solve the system’s
ambiguities.
7. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
7
3. CREATING AND CONSULTING A DATABASE
Of course, by introducing more such sentences (phrases), the user can create a table in a database
with the structure of a sentence; therefore each entry would contain the fields: subject, predicate,
predicative noun, direct object, indirect object, place adverbial etc. The detailed structure of the
table is presented in section 5, when we will discuss about the DIASEXP shystem we developed.
Consulting such a database would be made through Romanian interrogative sentences (phrases),
for example:
• Cine <predicat> <complement direct> ? ("Cine citeşte cartea ?") (Who
<predicate><direct object>? ("Who is reading the book?"))
in order to find out the <subject>, using a search engine based on pattern-matching (matching
patterns), in which the search clues would be the predicate (‘is reading’), and also the direct
object (‘book’).
Although the results of the analysis (and by this the answers to the questions also) have a high
degree of precision that depends on respecting the word order restrictions imposed by the
vocabulary taken into consideration, such a system that we have realised and that works in real
time can be successfully used. (Moreover, the system realised by us may enrich, through learning,
its vocabulary so that, by using only the 300 initial words and endings it may cover a wide range
of situations).
4. A GENERATIVE GRAMMAR MODEL FOR SYNTACTIC ANALYSIS
We present below (Figure 3) the grammar used by our system in the syntactic analysis3
:
1. Sentence → Subject Predicate Subject Predicate Other_part_of_sen
2. Other_part_of_sent→ Part_of_sentence Part_of_sentence Other_parts_of_sent
3. Part_of_sentence → Adverbial Object
4. Subject → Simple_subject Simple_subject Attrib_sub
5. Object → Direct_object Indirect_object
6. Adverbial → Where When How Goal Why
7. Direct_object → Dir_obj Dir_obj Attrib_do
8. Indirect_object → Indir_obj Indir_obj Attrib_io
9. Atrib_sub → Attribute
10. Atrrib_do → Atrribute
11. Atrrib_io → Atrribute
12. Atrribute → Adjective Possesion_word Pref_attrib Words T
Pref_attrib Word Word + T_pos Word + ‘ând’ Words T
13. Dir_obj → Word + T_do Pref_do Word
14. Indir_obj → Word + T_io Pref_ci Word
15. Where → Adv_where Pref_where Words T
16. When → Adv_when Pref_when Words T
17. How → Adv_how Pref_how Words T
18. Goal → Pref_goal Words T
19. Why → Pref_why Words T
20. Pref_attrib → ‘al’ ‘a’ ‘ai’ ‘ale’ ‘cu’ ‘de’ ‘din’
‘cel’ ‘cea’ ‘cei’ ‘cele’ ‘ce’ ‘care’ ...
21. Pref_do → ‘pe’
22. Pref_io → ‘lui’ ‘de’ ‘despre’ ‘cu’ ‘cui’ ...
23. Pref_where → ‘la’ ‘în’ ‘din’ ‘de la’ ‘lângă’ ‘în spatele’ ‘în josul’
3
the symbols from the grammar are written using Romanian
8. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
8
‘spre’ ‘unde’ ‘aproape de’ ‘către’
24. Pref_when → ‘când’ ‘peste’ ‘pe’ ‘înainte de’ ‘după’ ...
25. Pref_how → ‘altfel decât’ ‘astfel ca’ ‘în felul ‘ ‘cum’ ‘ aşa ca’ ‘în modul’ ...
26. Pref_goal → ‘pentru’‘pentru ca să’‘în vederea’ ‘cu scopul’‘pentru ca’ ...
27. Pref_why → ‘căci’ ‘pentru că’ ‘deoarece’ ‘fiindcă’ ‘întrucât’ ...
28. Adjective → ‘mare’ ‘mic’ ‘bun’ ‘frumos’ ‘înalt’ ...
29. Posession_word → ‘meu’ ‘tău’ ‘lor’ ‘tuturor’ ‘nimănui’ ...
30. Adv_where → ‘aici’ ‘acolo’ ‘dincolo’ ‘oriunde’ ...
31. Adv_when → ‘acum’‘atunci’ ‘vara’‘iarna’ ‘luni’ ‘totdeauna’‘seara’ ...
32. Adv_how → ‘aşa’ ‘bine’ ‘frumos’ ‘oricum’ ‘greu’ ...
33. T_pos → ‘lui’ ‘ei’ ‘ilor’ ‘elor’
34. T_do → ‘a’ ‘ul’ ‘ele’ ‘ile’
35. T_io → ‘ei’ ‘ului’ ‘elor’ ‘ilor’
36. T → ‘,’ ‘.’
37. Simple_subject → Words
38. Predicate → Words | Forms_of_to_be
39. Forms_of_to_be → ‘este’ | ‘e’ | ‘sunt’ | ‘ești’…
40. Direct_obj → Words
41. Indir_obj → Words
42. Adjective → ‘mare’ ‘mic’ ’frumos’ ’roșu’ …
43. Adv_where → ‘aici’ ‘acolo’ ’sus’ ’jos’ …
44. Adv_when → ‘dimineața’ ‘miercuri’ ’iarna’ ’atunci’ …
45. Adv_how → ‘așa’ ‘bine’ ’repede’ ’frumos’ …
46. Words → Word Words Words Words+’,‘ Words
47. Word → Character | Word+Character
48. Character → ‘A’ ‘B’ ... ‘Z’ ‘a’ ... ‘z’ ‘0’ ‘1’ ... ‘9’ ‘-’ ...
Figure 3. A grammar for Romanian, based on morphological characteristics
It is to be noted that the adverbs used are the "general’ ones, and the adjectives are those most
frequently used in common speech. Also, please note that by "+" we noted the concatenation of
two words: ‘fete’ + ‘lor’ = ‘fetelor’. This concatenation can be influenced, in some cases, by a
phonemic alternance, like in “fată”+”ei”=”fetei”, where we have the phonemic alternance a→e
(see (Pătruţ, 2010) for details).
4. DIASEXP
Using the grammar from the previous section, we developed the DIASEXP system. DIASEXP
have a simple text interface, where the user can introduce different phrases describing some
knowledge about some real life events. This collection of phrases is recorded as a “story”. Each
phrase of the story is analyzed by the system, which will automatically detect the parts of the
sentence and will add these into a table with the following fields:
1. Subject – this will represent the simple subject of the sentence (a noun or a pronoun) (see
rules 4 and 37 in Figure 3);
2. Attrib_sub – this will be the attribute of the subject (it can be, for example an adjective,
see rules 9 and 12 in the same figure);
3. Predicate – the predicate of the sentence will represent the main action of the assertive
sentence; the predicate can be represented by a normal verb or the copulative verb “a fi”
(“to be”), which will be folowed by a predicative noun (nume predicativ) - see rules 38
and 39;
4. Dir_obj – the direct object or the predicative noun (see rules 7, 13, 21, and 34);
5. Attribute_do – this will be the attribute of the direct object (see rules 7, 10, 12, and 28)
6. Indir_obj – the indirect object (see rules 8, 14, 22, and 35)
9. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
9
7. Attribute_io – this will be the attribute of the indirect object (the attributes will be
represented by adejectives or by some prepositional phrases) – see rules 8,11,12, and 28;
8. Where – this will be the place adverbial; see rules 3, 6, 15, 23, and 30;
9. When – this will be the time adverbial; see rules 3, 6, 16, 24, and 31;
10. How – this will be the manner adverbial; see rules 3, 6, 17, 25, and 32;
11. Goal – this will be the goal adverbial; it will be the answer at the question “Pentru ce…”
(“For what…”); see rules 3, 6, 18, and 26;
12. Why – this will be the cause adverbial; see rules 3, 6, 19, and 27.
When DIASEXP isn’t sure about which part of the sentence a word is, it will ask the user with
two or three variants, and the user will indicate the correct version. The next time, in a similar
situation, DIASEXP will know the answer, and it will not ask again. However, the use of commas
will be very helpful, in order to avoid the ambiguities.
After entering some assertive sentences, the user can enter some interrogative sentences. When
the user will ask something DIASEXP, it will answer consulting the assertive sentences it known
by that moment.
Below (Table 1) you can find an example of a “story”, from a dialogue between DIASEXP and a
human user. The assertive sentences are prefixed by A, the interogative sentences are prefixed by
I, and the DIASEXP’s answers are prefixed by R. In the right column the English translation is
presented.
Table 1. An example of a dialogue with DIASEXP (A=assertions, I=interogations, R=answers)
A: Elena este frumoasa, deoarece are ochi frumosi. Helen is beautiful, because she has beautiful
eyes.
A: Elena este frumoasă, deoarece are păr lung. Helen is beautiful, because she has long hair.
A: Elena este frumoasă, căci e suplă. Helen is beautiful, as she is slim.
A: Elena este placută, întrucât e sociabilă. Helen is enjoyable, as she is sociable.
A: Elena e placută, deoarece e harnică. Helen is enjoyable, because she is hard-
working.
A: Elena este sociabilă mereu. Helen is always sociable.
A: Elena e prietenoasă cu oamenii inteligenți. Helen is amiable with intelligent people.
A: Elena este prietena lui Adrian. Helen is Adrian’s girlfriend.
A: Adrian o iubeste pe Elena, căci e frumoasă și
harnică.
Adrian loves Helen, because she is beautiful and
hard-working.
A: Adrian nu iubește altă fată. Adrian doesn’t love another girl.
A: Adrian va dărui Elenei o floare mâine, deoarece
o iubește.
Adrian will give Helen a flower tomorrow
because he loves her.
A: Elena va fi bucuroasă de floare. Helen will be happy about the flower.
A: Părinții Elenei vorbesc despre Adrian. Helen’s parents talk about Adrian.
A: Părinții Elenei îl vor invita pe Adrian la cină,
politicos.
Helen’s parents will invite politely Adrian to
dinner.
A: Părinții Elenei îl plac pe Adrian, deoarece este
băiat bun.
Helen’s parents like Adrian, because hs is a
good boy.
A: Adrian este băiat bun, deoarece este student. Adrian is a good guy, because he is a student.
A: Adrian, cel bun, este student la facultate, fiindcă
a învățat mult.
Adrian, the good guy, is student in college,
because he learned a lot.
A: Adrian, studentul, este pasionat de informatică. Adrian, the student, is passionate about
computer science.
A: Adrian este student la informatică. Adrian is a student of computer science.
A: Elena, cea frumoasă, este elevă silitoare la liceu. Helen, the beautiful one, is a sedulous student in
high school.
10. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
10
A: Elena va citi azi bine lecția de istorie, pentru ca
să ia notă mare.
Elena read good history lesson today, so she
wants high grades.
A: Elena iubește pe Adrian. Elena loves Adrian.
A: Elena iubeste pe părinții ei. A: Elena loves her parents.
A: Elena s-ar căsători cu Adrian, fiindcă îl iubește. Elena would marry Adrian because he loves.
A: Adrian va pleca repede, astăzi la bunicii lui,
pentru ca să ii vadă.
Adrian will go quickly, today, at his
grandparents to see them.
I: Cum este Elena? How is Helen?
R: Elena este frumoasă.
Elena este plăcută.
Elena este elevă.
Helen is beautiful.
Helen is enjoyable.
Helen is a student.
I: Pe cine iubeşte Adrian. Whom does Adrian love?
R: Adrian iubeşte pe Elena.
Adrian nu iubeşte altă fată.
Adrian loves Helen.
Adrian doesn’t love another girl.
I: Unde va pleca Adrian? Where will Adrian go?
R: Adrian va pleca la bunicii lui? Adrian will go to his grandparents?
I: Când va pleca Adrian la bunicii lui? When will Adrian go to his grandparents?
R: Adrian va pleca astăzi la bunicii lui Adrian will go today to his grandparents.
I: De ce este plăcută Elena? Why is enjoyable Helen?
R: Elena este plăcută întrucât e sociabilă. Helen is enjoyable because she is sociable.
R: Elena este placută deoarece e harnică. Helen is enjoyable because she is hard-working.
Î: Cui va dărui Adrian? Whom will gives Adrian the book?
R: Adrian va dărui Elenei. Adrian will give the book to Helen.
I: Ce va dărui Adrian Elenei? What will give Adrian to Helen?
R: Adrian va dărui Elenei o floare. Adrian will give a flower to Helen.
The “story” represented by the assertions from Figure 4 are stored in the table of the database like
you can see in Table 2. It is to be noted that the system correctly detected the parts of sentence,
for every assertions. In the table, the blank spaces correspond to those parts of sentence that are
not present in that phrase.
Table 2. The table of the database, with the records of the story from Table 1
Subject
Attrib_sub
Predicate
Dir_obj
Attribute_d
o
Indir_obj
Attribute_io
When
Where
How
Goal
Why
Elena este frumoasă deoarece
are ochi
frumoși
Elena este frumoasă deoarece
are păr
lung
Elena este frumoasă căci e
suplă
Elena este plăcută întrucât e
sociabila
Elena e plăcută deoarece
e harnică
Elena este sociabilă mere
u
Elena este prietenoa
să
cu
oame
Int
eli
11. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
11
nii ge
nți
Elena este prietena lui
Adria
n
Adria
n
o
iubește
pe Elena căci e
frumoasă
și harnică
Adria
n
nu
iubește
fată altă
Adria
n
va
dărui
o floare
Elenei
mâin
e
deoarece
o iubește
Elena va fi bucuroas
ă
de
floare
părinț
ii
Elene
i
vorbes
c
despre
Adria
n
părinț
ii
Elene
i
îl vor
invita
pe
Adrian
la ei pol
itic
os
părinț
ii
Elene
i
îl plac pe
Adrian
deoarece
este băiat
bun
Adria
n
este băiat bun deoarece
este
student
Adria
n
cel
bun
este student la
facu
ltate
fiindcă a
învățat
mult
Adria
n
stude
ntul
este pasionat de
infor
matic
ă
Elena cea
frum
oasă
este elevă silitoa
re
la
lice
u
Elena iubește pe
Adrian
Elena iubește pe
părinții
ei
Elena s-ar
casator
i
cu
Adria
n
fiindcă îl
iubeste
Adria
n
va
pleca astăz
i
la
buni
cii
lui
rep
ede
pentr
u ca
sa ii
vada
The system was developed in Pascal programming language and was tested by our team on
different situations. In graph from Figure 3 you can see the results of analyzing 4 stories, after
entering respectively 20, 50, 100, and 300 sentences. The average of good results is over 80%.
12. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
12
0%
20%
40%
60%
80%
100%
120%
Story 1 Story 2 Story 3 Story 4
20 sentences
50 sentences
100 sentences
300 sentences
Figure 3. Results of testing DIASEXP on different stories
5. CONCLUSIONS
CFG and DCG are types of generative grammars used in the syntactic analysis of a phrase in
natural language. Sometimes, long or complex phrases will be a problem for the classic chart-
parsers. The characteristic features of the Romanian language related to the morphology of words
or the way in which adverbials are formed can be successfully used in the syntactic analysis by
using a system based on patterns. This system will work in real-time and will not record the
whole dictionary of Romanian language. It will use a small dictionary of “prefixes” (prepositions,
adverbs etc.), and some endings and linking words.
ACKNOWLEDGEMENTS
The authors would like to thank to professors Dan Cristea, Grigor Moldovan and Ioan Andone for
their advices and observations during the research activities.
REFERENCES
[1] Chomsky, N. (1965). Aspects of the Theory of Syntax. MIT Press, 1965. ISBN 0-262-53007-4
[2] Cristea, Dan (2003). Curs de Lingvistică computaţională, Faculty of Informatics, "Alexandru Ioan
Cuza" University of Iaşi, Romania.
[3] Giurcă, Adrian (1997). Procesarea limbajului Natural. Analiza automată a frazei. Aplicaţii, University
of Craiova, Faculty of Mathematics-Informatics, Romania.
[4] Klein, E. (2005). Context Free Grammars. Retrieved from
www.inf.ed.ac.uk/teaching/courses/icl/lectures/2006/cfg-lec.pdf at 1st December 2012
[5] Pătruţ Bogdan (2010). “ADX — Agent for Morphologic Analysis of Lexical Entries in a Dictionary”.
BRAIN. Broad Research in Artificial Intelligence and Neuroscience, vol. 1 (1), 2010, ISSN 2067-
3957.
[6] Pătruţ, Bogdan & Boghian, Ioana (2012). Natural Language Processing: Syntactic Analysis, Lexical
Disambiguation, Logical Formalisms, Discourse Theory, Bivalent Verbs, Germany, Munich: AVM –
Akademische Verlagsgemeinschaft München, ISBN 978-3869242071.
[7] Pătruţ, Bogdan & Boghian, Ioana (2010). "A Delphi Application for the Syntactic and Lexical
Analysis of a Phrase Using Cocke, Younger, and Kasami Algorithm" in BRAIN: Broad Research in
Artificial Intelligence and Neuroscience, Vol. 1, Issue 2, EduSoft: 2010.
[8] Pereira, F. & D. Warren (1986). “Definite clause grammars for language analysis”. Readings in
natural language processing, pp. 101-124, USA, San Francisco: Morgan Kaufmann Publishers Inc.,
ISBN:0-934613-11-7.
[9] Russel, Stuart & Norvig, Peter (2002). Artificial Intelligence - A Modern Approach, 2nd International
Edition, USA, Menlo Park: Addison Wesley.
13. International Journal on Natural Language Computing (IJNLC) Vol. 1, No.4, December 2012
13
Author
Bogdan Pătruţ
Bogdan Pătruţ is associate professor in computer science at Vasile Alecsandri
University of Bacau, Romania, with a Ph D in computer science and a Ph D in
accounting. His domains of interest/ research are natural language processing, multi-
agent systems and computer science applied in social, economic and political sciences.
He published papers in international journals in these domains. Also, Dr. Pătruţ is the
author or editor of more books on programming, algorithms, artificial intelligence,
interactive education, and social media.