This paper presents a strategy and a computational model for solving inter-sentential anaphoric pronouns
in Vietnamese paragraphs composing simple sentences. The strategy is proposed based on grammatical
features of nouns and the focus phenomenon when using pronouns in Vietnamese. In this research, we
consider only nouns and pronouns which are human objects in the paragraph, and each anaphoric
pronoun will appear one time in one sentence and can appear in adjacent sentences. The computational
model is implemented in Prolog and based on applying and improving the models of Mark Johnson and
Ewan Klein, had been improved by Covington and Schmitz, with theoretical background of Discourse
Representation Theory.Analysis of test results shows that this approach which based on linguistic theories
helps for well solving inter-sentential anaphoric pronouns in Vietnamese paragraphs.
Abstract A usage of regular expressions to search text is well known and understood as a useful technique. Regular Expressions are generic representations for a string or a collection of strings. Regular expressions (regexps) are one of the most useful tools in computer science. NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, information extraction, & speech recognition. This paper helps a reader to give a general review on usage of regular expressions illustrated with examples from natural language processing. In addition, there is a discussion on different approaches of regular expression in NLP. Keywords— Regular Expression, Natural Language Processing, Tokenization, Longest common subsequence alignment, POS tagging
----------------------------
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract A usage of regular expressions to search text is well known and understood as a useful technique. Regular Expressions are generic representations for a string or a collection of strings. Regular expressions (regexps) are one of the most useful tools in computer science. NLP, as an area of computer science, has greatly benefitted from regexps: they are used in phonology, morphology, text analysis, information extraction, & speech recognition. This paper helps a reader to give a general review on usage of regular expressions illustrated with examples from natural language processing. In addition, there is a discussion on different approaches of regular expression in NLP. Keywords— Regular Expression, Natural Language Processing, Tokenization, Longest common subsequence alignment, POS tagging
----------------------------
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Semantic Peculiarities of Antonyms Based on the Works by I. YusupovYogeshIJTSRD
The article depicts stylistic features of antonyms in English and Karakalpak languages, through analyzing comparatively, and to note stylistic peculiarities, lexical and semantic features of antonyms in English and Karakalpak languages. Also, the some peculiarities of antonyms are described based on the work by the Karakalpak writer I.Yusupov. The semantic, comparative and descriptive analysis method was used to express the differences of antonyms in these languages. Furthermore, the article suggests some ways and techniques of teaching antonyms that can be effective in the foreign language teaching process. Bayrieva Maryam Jangabaevna "Semantic Peculiarities of Antonyms (Based on the Works by I. Yusupov)" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd41079.pdf Paper URL: https://www.ijtsrd.com/humanities-and-the-arts/education/41079/semantic-peculiarities-of-antonyms-based-on-the-works-by-i-yusupov/bayrieva-maryam-jangabaevna
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
We all do our research and put an effort in making a clear and an accurate presentation, but I'd be glad if this could help especially for those who are taking major in English and the like. Good luck!
A proper credit would be appreciated.
• Jay-ar A. Padernal, BSEd Major in English, University of Mindanao
This research describes an attempt to establish a pedagogically useful list of the most frequent semantically non-compositional multi-word combinations for English for Journalism learners in an EFL context, who need to read English news in their field of study. The list was compiled from the NOW (News on the Web) Corpus, the largest English news database by far. In consideration of opaque multi-word combinations in widespread use and pedagogical value, the researcher applied a set of selection criteria when using the corpus. Based on frequency, meaningfulness, and semantic non-compositionality, a total of 318 non-compositional multi-word combinations of 2 to 5 words with the exclusion of phrasal verbs were selected and they accounted for approximately 2% of the total words in the corpus. The list, not highly technical in nature, contains the most commonly-used multi-word units traversing various topic areas and news readers may encounter these phrasal expressions very often. As with other individual word lists, it is hoped that this opaque expressions list may serve as a reference for English for Journalism teaching.
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCEScscpconf
This paper describes a context free grammar (CFG) based grammatical relations for Myanmar
sentences which combine corpus-based function tagging system. Part of the challenge of
statistical function tagging for Myanmar sentences comes from the fact that Myanmar has freephrase-order
and a complex morphological system. Function tagging is a pre-processing step to
show grammatical relations of Myanmar sentences. In the task of function tagging, which tags
the function of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging
and chunking information, we use Naive Bayesian theory to disambiguate the possible function
tags of a word. We apply context free grammar (CFG) to find out the grammatical relations of
the function tags. We also create a functional annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar sentences. Experiments show that our analysis achieves a good result with simple sentences and complex sentences.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
The Application of Distributed Morphology to the Lithuanian First Accent Noun...Adrian Lin
In this paper I apply the theoretical linguistics framework of distributed morphology to analyse the Lithuanian first accent noun declension paradigm. Lithuanian is an Indo-European language that retains numerous archaic features lost in its other Indo-European languages. Its nouns has 12 declensions and, like other aspects of its grammar, is very complex. As such, this topic is well suited by a distributed morphology analysis.
Analysis of an image spam in email based on content analysisijnlc
Researchers initially have addressed the problem of spam detection as a text classification or
categorization problem. However, as spammers’ continue to develop new techniques and the type of email
content becomes more disparate, text-based anti-spam approaches alone are not sufficiently enough in
preventing spam. In an attempt to defeat the anti-spam development technologies, spammers have recently
adopted the image spam trick to make the scrutiny of emails’ body text inefficient. The main idea behind
this project is to design a spam detection system. The system will be enabled to analyze the content of
emails, in particular the artificially generated image sent as attachment in an email. The system will
analyze the image content and classify the embedded image as spam or legitimate hence classify the email
accordingly.
Semantic Peculiarities of Antonyms Based on the Works by I. YusupovYogeshIJTSRD
The article depicts stylistic features of antonyms in English and Karakalpak languages, through analyzing comparatively, and to note stylistic peculiarities, lexical and semantic features of antonyms in English and Karakalpak languages. Also, the some peculiarities of antonyms are described based on the work by the Karakalpak writer I.Yusupov. The semantic, comparative and descriptive analysis method was used to express the differences of antonyms in these languages. Furthermore, the article suggests some ways and techniques of teaching antonyms that can be effective in the foreign language teaching process. Bayrieva Maryam Jangabaevna "Semantic Peculiarities of Antonyms (Based on the Works by I. Yusupov)" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd41079.pdf Paper URL: https://www.ijtsrd.com/humanities-and-the-arts/education/41079/semantic-peculiarities-of-antonyms-based-on-the-works-by-i-yusupov/bayrieva-maryam-jangabaevna
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation, POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex sentences.
We all do our research and put an effort in making a clear and an accurate presentation, but I'd be glad if this could help especially for those who are taking major in English and the like. Good luck!
A proper credit would be appreciated.
• Jay-ar A. Padernal, BSEd Major in English, University of Mindanao
This research describes an attempt to establish a pedagogically useful list of the most frequent semantically non-compositional multi-word combinations for English for Journalism learners in an EFL context, who need to read English news in their field of study. The list was compiled from the NOW (News on the Web) Corpus, the largest English news database by far. In consideration of opaque multi-word combinations in widespread use and pedagogical value, the researcher applied a set of selection criteria when using the corpus. Based on frequency, meaningfulness, and semantic non-compositionality, a total of 318 non-compositional multi-word combinations of 2 to 5 words with the exclusion of phrasal verbs were selected and they accounted for approximately 2% of the total words in the corpus. The list, not highly technical in nature, contains the most commonly-used multi-word units traversing various topic areas and news readers may encounter these phrasal expressions very often. As with other individual word lists, it is hoped that this opaque expressions list may serve as a reference for English for Journalism teaching.
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCEScscpconf
This paper describes a context free grammar (CFG) based grammatical relations for Myanmar
sentences which combine corpus-based function tagging system. Part of the challenge of
statistical function tagging for Myanmar sentences comes from the fact that Myanmar has freephrase-order
and a complex morphological system. Function tagging is a pre-processing step to
show grammatical relations of Myanmar sentences. In the task of function tagging, which tags
the function of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging
and chunking information, we use Naive Bayesian theory to disambiguate the possible function
tags of a word. We apply context free grammar (CFG) to find out the grammatical relations of
the function tags. We also create a functional annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar sentences. Experiments show that our analysis achieves a good result with simple sentences and complex sentences.
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
call for paper 2012, hard copy of journal, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
The Application of Distributed Morphology to the Lithuanian First Accent Noun...Adrian Lin
In this paper I apply the theoretical linguistics framework of distributed morphology to analyse the Lithuanian first accent noun declension paradigm. Lithuanian is an Indo-European language that retains numerous archaic features lost in its other Indo-European languages. Its nouns has 12 declensions and, like other aspects of its grammar, is very complex. As such, this topic is well suited by a distributed morphology analysis.
Analysis of an image spam in email based on content analysisijnlc
Researchers initially have addressed the problem of spam detection as a text classification or
categorization problem. However, as spammers’ continue to develop new techniques and the type of email
content becomes more disparate, text-based anti-spam approaches alone are not sufficiently enough in
preventing spam. In an attempt to defeat the anti-spam development technologies, spammers have recently
adopted the image spam trick to make the scrutiny of emails’ body text inefficient. The main idea behind
this project is to design a spam detection system. The system will be enabled to analyze the content of
emails, in particular the artificially generated image sent as attachment in an email. The system will
analyze the image content and classify the embedded image as spam or legitimate hence classify the email
accordingly.
Annotation for query result records based on domain specific ontologyijnlc
The World Wide Web is enriched with a large collection of data, scattered in deep web databases and web
pages in unstructured or semi structured formats. Recently evolving customer friendly web applications
need special data extraction mechanisms to draw out the required data from these deep web, according to
the end user query and populate to the output page dynamically at the fastest rate. In existing research
areas web data extraction methods are based on the supervised learning (wrapper induction) methods. In
the past few years researchers depicted on the automatic web data extraction methods based on similarity
measures. Among automatic data extraction methods our existing Combining Tag and Value similarity
method, lags to identify an attribute in the query result table. A novel approach for data extracting and
label assignment called Annotation for Query Result Records based on domain specific ontology. First, an
ontology domain is to be constructed using information from query interface and query result pages
obtained from the web. Next, using this domain ontology, a meaning label is assigned automatically to each
column of the extracted query result records.
Inter rater agreement study on readability assessment in bengaliijnlc
An inter-rater agreement study is performed for readability assessment in Bengali. A 1-7 rating scale was
used to indicate different levels of readability. We obtained moderate to fair agreement among seven
independent annotators on 30 text passages written by four eminent Bengali authors. As a by product of
our study, we obtained a readability-annotated ground truth dataset in Bengali.
English kazakh parallel corpus for statistical machine translationijnlc
This paper presents problems and solutions in developing English-Kazakh parallel corpus at the School of
Mechanics and Mathematics of the al-Farabi Kazakh National University. The research project included
constructing a 1,000,000-word English-Kazakh parallel corpus of legal texts, developing an English-
Kazakh translation memory of legal texts from the corpus and building a statistical machine translation
system. The project aims at collecting more than ten million words. The paper further elaborates on the
procedures followed to construct the corpus and develop the other products of the research project.
Methods used for collecting data and the results are discussed, errors during the process of collecting data
and how to handle these errors will be described
NLization of Nouns, Pronouns and Prepositions in Punjabi With EUGENE ijnlc
Universal Networking Language (UNL) has been used by various researchers as an Interlingua approach
for AMT (Automatic machine translation). The UNL system consists of two main components/tools,
namely, EnConverter-IAN (used for converting the text from a source language to UNL) and
DeConverter - EUGENE (used for converting the text from UNL to a target language). This paper
highlights the DeConversion generation rules used for the DeConverter and indicates its usage in the
generation of Punjabi sentences. This paper also covers the results of implementation of UNL input by
using DeConverter-EUGENE and its evaluation on UNL sentences such as Nouns, Pronouns and
Prepositions.
Flavius claudius julian's rhetorical speeches stylistic and computational app...ijnlc
The purpose of this study is to examine the rhetorical, political speeches of Julian the emperor using
computational tools. For this reason, in this research we apply corpus linguistics techniques for the
automatic extraction of word, collocation lists and lexical bundles from Julian’s speeches; using corpus
linguistics techniques we will draw conclusions about his style and character.
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABIijnlc
Machine Transliteration has come out to be an emerging and a very important research area in the field of
machine translation. Transliteration basically aims to preserve the phonological structure of words. Proper
transliteration of name entities plays a very significant role in improving the quality of machine translation.
In this paper we are doing machine transliteration for English-Punjabi language pair using rule based
approach. We have constructed some rules for syllabification. Syllabification is the process to extract or
separate the syllable from the words. In this we are calculating the probabilities for name entities (Proper
names and location). For those words which do not come under the category of name entities, separate
probabilities are being calculated by using relative frequency through a statistical machine translation
toolkit known as MOSES. Using these probabilities we are transliterating our input text from English to
Punjabi.
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences
Parsing of Myanmar Sentences With Function Taggingkevig
This paper describes the use of Naive Bayes to address the task of assigning function tags and context free
grammar (CFG) to parse Myanmar sentences. Part of the challenge of statistical function tagging for
Myanmar sentences comes from the fact that Myanmar has free-phrase-order and a complex
morphological system. Function tagging is a pre-processing step for parsing. In the task of function
tagging, we use the functional annotated corpus and tag Myanmar sentences with correct segmentation,
POS (part-of-speech) tagging and chunking information. We propose Myanmar grammar rules and apply
context free grammar (CFG) to find out the parse tree of function tagged Myanmar sentences. Experiments
show that our analysis achieves a good result with parsing of simple sentences and three types of complex
sentences.
The important problem of word segmentation in Thai language is sentential noun phrase. The existing
studies try to minimize the problem. But there is no research that solves this problem directly. This study
investigates the approach to resolve this problem using conditional random fields which is a probabilistic
model to segment and label sequence data. The results present that the corrected data of noun phrase was
detected more than 78.61 % based on our technique.
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers
In recent years, great advances have been made in the speed, accuracy, and coverage of automatic word
sense disambiguator systems that, given a word appearing in a certain context, can identify the sense of
that word. In this paper we consider the problem of deciding whether same words contained in different
documents are related to the same meaning or are homonyms. Our goal is to improve the estimate of the
similarity of documents in which some words may be used with different meanings. We present three new
strategies for solving this problem, which are used to filter out homonyms from the similarity computation.
Two of them are intrinsically non-semantic, whereas the other one has a semantic flavor and can also be
applied to word sense disambiguation. The three strategies have been embedded in an article document
recommendation system that one of the most important Italian ad-serving companies offers to its customers.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERkevig
Morphological stemming becomes a critical step toward natural language processing. The process of
stemming is to reduce alternative forms to a common morphological root. Word segmentation for
Myanmar Language, like for most Asian Languages, is an important task and extensively-studied
sequence labelling problem. Named entity detection is one of the issues in Asian Language that has
traditionally required a large amount of feature engineering to achieve high performance. The new
approach is integrating them that would benefit in all these processes. In recent years, end-to-end
sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks.
We trained the model using manually annotated corpora. State-of-the-art named entity recognition
systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that
relies on two sources of information: character level representation and syllable level representation.
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMERijnlc
Morphological stemming becomes a critical step toward natural language processing. The process of stemming is to reduce alternative forms to a common morphological root. Word segmentation for Myanmar Language, like for most Asian Languages, is an important task and extensively-studied sequence labelling problem. Named entity detection is one of the issues in Asian Language that has traditionally required a large amount of feature engineering to achieve high performance. The new approach is integrating them that would benefit in all these processes. In recent years, end-to-end sequence labelling models with deep learning are widely used. This paper introduces a deep BiGRUCNN-CRF network that jointly learns word segmentation, stemming and named entity recognition tasks. We trained the model using manually annotated corpora. State-of-the-art named entity recognition systems rely heavily on handcrafted feature built in our new approach, we introduce the joint model that relies on two sources of information: character level representation and syllable level representation.
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...dannyijwest
Considerable research in the field of ontology matching has been performed where information sharing
and reuse becomes necessary in ontology development. Measurement of lexical similarity in ontology
matching is performed using synset, defined in WordNet. In this paper, we defined a Super Word Set,
which is an aggregate set that includes hypernym, hyponym, holonym, and meronym sets in WordNet.
The Super Word Set Similarity is calculated by the rate of words of concept name and synset’s words
inclusion in the Super Word Set. In order to measure of Super Word Set Similarity, we first extracted
Matched Concepts(MC), Matched Properties(MP) and Property Unmatched Concepts(PUC) from the
result of ontology matching. We compared these against two ontology matching tools – COMA++ and
LOM. The Super Word Set Similarity shows an average improvement of 12% over COMA++ and 19%
over LOM.
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Waqas Tariq
A \"sentence pattern\" in modern Natural Language Processing is often considered as a subsequent string of words (n-grams). However, in many branches of linguistics, like Pragmatics or Corpus Linguistics, it has been noticed that simple n-gram patterns are not sufficient to reveal the whole sophistication of grammar patterns. We present a language independent architecture for extracting from sentences more sophisticated patterns than n-grams. In this architecture a \"sentence pattern\" is considered as n-element ordered combination of sentence elements. Experiments showed that the method extracts significantly more frequent patterns than the usual n-gram approach.
THE DIFFERENCES BETWEEN SYNTAX AND SEMANTICSHENOK SHIHEPO
Syntax and Semantics apply to several different fields such as Linguistics, Computer science and in the Philosophy of Languages. This essay will deliberate on the differences and some commonalities to the meanings of these terms, and their relationship as well.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection
of data on the web there is a need for grouping(clustering) the documents into clusters for speedy
information retrieval. Clustering of documents is collection of documents into groups such that the
documents within each group are similar to each other and not to documents of other groups. Quality of
clustering result depends greatly on the representation of text and the clustering algorithm. This paper
presents a comparative analysis of three algorithms namely K-means, Particle swarm Optimization (PSO)
and hybrid PSO+K-means algorithm for clustering of text documents using WordNet. The common way of
representing a text document is bag of terms. The bag of terms representation is often unsatisfactory as it
does not exploit the semantics. In this paper, texts are represented in terms of synsets corresponding to a
word. Bag of terms data representation of text is thus enriched with synonyms from WordNet. K-means,
Particle Swarm Optimization (PSO) and hybrid PSO+K-means algorithms are applied for clustering of
text in Nepali language. Experimental evaluation is performed by using intra cluster similarity and inter
cluster similarity.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
A COMPUTATIONAL APPROACH FOR ANALYZING INTER-SENTENTIAL ANAPHORIC PRONOUNS IN VIETNAMESE PARAGRAPHS
1. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
DOI : 10.5121/ijnlc.2013.2303 23
A COMPUTATIONAL APPROACH FOR ANALYZING
INTER-SENTENTIAL ANAPHORIC PRONOUNS IN
VIETNAMESE PARAGRAPHS
Trung Tran1
and Dang Tuan Nguyen2
1
Faculty of Computer Science, University of Information Technology, Vietnam National
University - Ho Chi Minh City, Vietnam
ttrung@nlke-group.net
2
Faculty of Computer Science, University of Information Technology, Vietnam National
University - Ho Chi Minh City, Vietnam
dangnt@uit.edu.vn
ABSTRACT
This paper presents a strategy and a computational model for solving inter-sentential anaphoric pronouns
in Vietnamese paragraphs composing simple sentences. The strategy is proposed based on grammatical
features of nouns and the focus phenomenon when using pronouns in Vietnamese. In this research, we
consider only nouns and pronouns which are human objects in the paragraph, and each anaphoric
pronoun will appear one time in one sentence and can appear in adjacent sentences. The computational
model is implemented in Prolog and based on applying and improving the models of Mark Johnson and
Ewan Klein, had been improved by Covington and Schmitz, with theoretical background of Discourse
Representation Theory.Analysis of test results shows that this approach which based on linguistic theories
helps for well solving inter-sentential anaphoric pronouns in Vietnamese paragraphs.
KEYWORDS
Inter-sentential Anaphora, Anaphora Resolution, Discourse Representation
1. INTRODUCTION
Solving inter-sentential anaphoric pronouns in a Vietnamese paragraph is an important research
topic in natural language processing, especially in text comprehension researches. Many authors
have proposed different approaches with strategies and models for finding exact antecedents of
anaphoric pronouns in paragraphs. In their researches, Mark Johnson and Ewan Klein [7],
Covington and Schmitz [9], Blackburn and Bos [12] have proposed models based on Discourse
Representation Theory [4] and constraints about number and gender of pronouns in English to
find antecedents of anaphoric pronouns. A different approach using WordNet Ontology of Tyne
Liang Dian-Song Wu [14] to identify the animate entity and the information about gender of the
entity. The system also uses same characteristics about gender of the object and the distinction
between animate, non-animate objects in English and proposes some heuristic rules to solve
anaphoric pronouns. Some other researches, such as Michel Denber [11], proposed the solution
based on characteristics about number and gender of objects in English, with additional
constraints of animate, non-animate objects and the syntax of words in sentences. Another theory
is also widely used as the basis of many researches for solving the anaphoric pronouns is
Centering Theory, developed by Barbara J. Grosz, Aravind K. Joshi and Scott Weinstein [2] in
the 1980s.
2. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
24
In this paper, we present a strategy and a computational model for solving inter-sentential
anaphoric pronouns in Vietnamese paragraphs composing simple sentences. The strategy is
proposed based on grammatical features of nouns and pronouns in Vietnamese and the focus
phenomenon when using pronouns in Vietnamese. The model is designed accordant with the
strategy based on Discourse Representation Theory [4] consists of four main components: the
component for analysing the syntactic structure of the paragraph and sentences with the top-down
method and describing by Unification-Based Grammar – UBG [8], [13], the component for
describing lexical characteristics structures by Unification-Based Grammar [8], [13], the
component for building Discourse Representation Structure, the component for finding
antecedents with algorithms based on the strategy. To perform in Prolog, we apply the model of
Mark Johnson and Ewan Klein [7], Covington and Schmitz [9] with some improvements
accordant with the strategy of solving inter-sentential anaphoric pronouns in Vietnamese
paragraphs as follows:
• Only resolve inter-sentential anaphoric pronouns, and the antecedent appears in the
sentence preceding the sentence containing the pronoun.
• Do not analyse the paragraph into sentences using recursive method, instead determine
the position of each sentence.
• Describe characteristics of lexical in Vietnamese grammar.
• The algorithm of finding the antecedent of inter-sentential anaphoric pronoun based on
the strategy.
In this research, we limit the consideration of the following forms of paragraph:
Form 1: The paragraph having only one anaphoric pronoun:
Example 1: “Nhân học môn vẽ. Anh dùng bút chì. Nghĩa hỏi anh.”
(English: “Nhân learns painting. He uses pencil. Nghĩa asks him.”)
[anh = Nhân]
Form 2: The paragraph having two anaphoric pronouns which appear in different sentences:
Example 2: “Lễ đọc sách. Anh thấy Chí. Anh ta đọc báo.”
(English: “Lễ reads book. He sees Chí. He reads newspaper.”)
[anh = Lễ, anh ta = Chí]
Form 3: The paragraph having two anaphoric pronouns which appear in the same sentence:
Example 3: “Lan học môn toán. Chị hỏi Mai. Chị ấy giúp chị.”
(English: “Lan learns maths. She asks Mai. She helps her.”)
[chị = Lan, chị ấy = Mai]
2. BACKGROUND
2.1. Discourse Representation Theory
The Discourse Representation Theory model had been introduced in [4] with the basic idea:
a natural language discourse will be presented in the context of representative structure, which is
called Discourse Representation Structures – DRS. According to [4], a Discourse
Representation Structures will include an order pair <U, Con>, where U is a list of discourse
markers, or can be interpreted as objects of the discourse, and Con is a list of conditions, or can be
interpreted as predicates or formulas that objects in U have to satisfy.
Example 4: Consider the paragraph having three sentences:
“Nhân học môn vẽ. Anh dùng bút chì. Nghĩa hỏi anh.”
3. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
25
(English: “Nhân learns painting. He uses pencil. Nghĩa asks him.”)
This paragraph will have the following Discourse Representation Structure:
• Objects in set U: X1 – Nhân, X2 – môn vẽ, X3 – bút chì , X4 – Nghĩa.
• Conditions in set Con: tên(X1, [Nhân]),môn_vẽ(X2), học(X1,X2), bút_chì(X3), dùng(X1,
X3), tên(X4, [Nghĩa]), hỏi(X4, X1).
This structure is represented in table 1:
Table 1. The Discourse Representation Structure of the paragraph “Nhân học môn vẽ. Anh dùng bút chì.
Nghĩa hỏi anh”.
X1, X2, X3, X4
tên(X1,[Nhân])
môn_vẽ(X2)
học(X1,X2)
bút_chì(X3)
dùng(X1,X3)
tên(X4,[Nghĩa])
hỏi(X4,X1)
2.2. Unification-Based Grammar
In [8], [13], the authors introduced theories Unification-based and Unification-based Grammar
with the basic idea: Unification-Based Grammar is a formalism in which theories of grammar can
be expressed, with the prominent role of unifying feature structures. In the analysis of the
syntactic structure of sentences, in each constituent or lexical, can describe the additional
characteristic structure of this constituent or lexical.
3. THE STRATEGY FOR SOLVING INTER-SENTENTIAL ANAPHORIC
PRONOUNS IN VIETNAMESE PARAGRAPH
In this section, we will present the strategy for solving inter-sentential anaphoric pronouns in
Vietnamese paragraphs composing simple sentences. This strategy is based on grammatical
features of nouns and pronouns in Vietnamese as well as the focus phenomenon in the use of
pronouns in Vietnamese paragraph.
In Vietnamese, nouns or pronouns only distinguish characteristic of human or animals, non-
animate object, not distinguish gender. Although there are some pronouns such as “anh” or “cô”
have the distinction of male and female, but also pronouns like “em”, “nó” do not specify the
gender. Therefore, different from [7], [9], in this research, we do not use the gender characteristic,
instead will be based on grammatical characteristics of nouns that distinguish human with animals
or non-animate objects and role characteristic of nouns is the subject or object of verb in the
sentence to find the antecedent for inter-sentential anaphoric pronoun. The main idea of using
these two features is “depending on the anaphoric pronoun stand alone or stand with “ấy” / “ta” /
“này”, will focus in nouns which indicate human and take the subject or object role of verb in
preceding sentences”. These two constraints help to solve inter-sentential anaphoric pronouns for
paragraphs composing more than two sentences in comparison with the model of [7], [9].The
4. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
26
paragraphs are considered in this research in the forms presented in Introduction section will have
following characteristics:
• The number of sentences is not determined in the range from 3 to 5 sentences.
• There are only human anaphoric pronouns.
• There are only one or two human antecedents among several ones.
• The anaphoric pronouns and antecedents can appear in sentences that are not adjacent.
• Each anaphoric pronoun will appear one time in one sentence or can appear in adjacent
sentences.
The strategy for solving inter-sentential anaphoric pronouns in Vietnamese paragraph is:
• Mark the order position of sentences in the paragraph.
• Find the exact antecedent at preceding sentences of the one containing the anaphoric
pronoun.
• The anaphoric pronoun stands alone: Find the antecedent is focusing human object, takes
the subject role of the verb in preceding sentences.
• The anaphoric pronoun + “ta” / “ấy” / “này”: Find the antecedent is focusing human
object, takes the object role of the verb in preceding sentences.
The finding strategy will be illustrated with the example in Introduction section as follows:
Example 5: “Lễ đọc sách. Anh thấy Chí. Anh ta đọc báo.”
(English: “Lễ reads book. He sees Chí. He reads newspaper.”)
Pronoun “anh” stand alone in the second sentence, will focus to proper noun “Lễ”
indicating human object and taking the subject role of verb in the first sentence.
Therefore, the antecedent of anaphoric pronoun “anh” is object “Lễ”.
Pronoun “anh” plus “ta” in the third sentence, will focus to proper noun “Chí” indicating
human object and taking the object role of verb in the second sentence. Therefore, the
antecedent of anaphoric pronoun “anh ta” is object “Chí”.
4. THE SYSTEM MODEL
In this section, we present the system model and algorithm to find antecedents of inter-sentential
anaphoric pronouns that accordant with the strategy proposed above. The system model is
designed based on Discourse Representation Theory [4], applied and improved the model of [7],
[9], consists of four main components: the component for analysing the syntactic structure of the
paragraph and sentences with the top-down method and describing by Unification-Based
Grammar [8], [13], the component for describing lexical characteristics structures by Unification-
Based Grammar [8], [13], the component for building Discourse Representation Structure, the
component for finding the antecedent with the algorithm based on the strategy. The model is
represented as follow:
5. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
27
Figure 1. The general model for analyzing inter-sentential anaphoric pronoun.
Components of the system are demonstrated in more detail as follows:
4.1. Analysing the Syntactic Structures of Paragraphs and Sentences
This component with analyze the syntactic structure of the paragraph into sentences. Different
from the model of [7], [9], in this component, we do not analyze using recursive method, instead
clear separate into sentences and index the position for each sentence to distinguish the order of
sentences in the paragraph. The analysis is performed by top-down rules as follows:
discourse --> statement_first, endpunct, statement_second,
endpunct, statement_third, endpunct.
discourse --> [].
endpunct --> [’.’].
Figure 2. Analyze the syntactic structure of paragraph into sentences using top-down method.
In this analysis, a paragraph is separated into three sentences. These sentences will be indexed the
position order. In the description of analyzing syntactic structure of the paragraph into sentences
by UBG in Prolog based on the model of [9], we define a flag flag_position to describe the
position syntactic characteristic of each sentence in the paragraph. The characteristic
flag_position of each sentence will take the value corresponding to the position of this
sentence in the paragraph as follow:
discourse(D) --> {
S1 = syn~flag_position~[first],
D = sem~in~A,
S1 = sem~in~A,
S1 = sem~out~B,
S2 = syn~flag_position~[second],
S2 = sem~in~B,
S2 = sem~out~C,
6. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
28
S3 = syn~flag_position~[third],
S3 = sem~in~C,
S3 = sem~out~E,
D = sem~out~E
},
statement(S1),
endpunct,
statement(S2),
endpunct,
statement(S3),
endpunct.
Figure 3. Analyze the syntactic structure of the paragraph into sentences in Prolog.
This component will analyze the constituent structure of each sentence into smaller constituents:
noun phrases, verb phrases, lexical. In the process of analysis, based on the advantage of
transferring data up and down between constituents of UBG, the position characteristic
flag_position will be transferred to smaller constituents to determine the position of each
constituent in the paragraph. In this research, we consider three types of simple sentence forming
the paragraph will have the following constituent structures:
• Noun phrase + Verb phrase
Example 6: “Lễ đọc sách.” (English: Lễ reads books.)
• Noun phrase + Adjective
Example 7: “Lễ hạnh phúc.” (English: Lễ is happy.)
• Noun phrase + “là” (is) + Noun phrase
Example 8: “Lễ là giám đốc.” (English: Lễ is manager.)
The analysis of the constituent structure of each sentence into smaller constituents will be
performed by top-down rules as follow:
s --> np, vp.
s --> np, adj.
s --> np, [là], np.
np --> n(class:proper).
7. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
29
np --> [anh]; [anh ấy]; [cô]; [cô ấy]; [chị]; [chị ấy]; [ông];
[ông ấy]; [bà]; [bà ấy]; [em]; [em ấy]; [bạn]; [bạn ấy].
np --> n(class:common).
vp --> v, np.
vp --> v.
Figure 4. Analyze the constituent structure of each sentence into smaller constituents using top-down rules.
4.2. Describing the Syntactic and Semantic Characteristics of Words
After analyzing the constituent structure of each sentence into word level, this component will
describe syntactic and semantic characteristics of each word depending on its category. Because
of only considering three types of simple sentences described above, this component describes
only words belong to three categories: noun, verb, adjective. To accordant with the strategy for
solving the inter-sentential anaphoric pronoun proposed above, grammatical characteristics of
noun are concentrated to be described: the unique index characteristic is defined exclusively for
each noun, the position characteristic takes the value transferred from the position characteristic
of the sentence, the characteristic indicates human or thing object (animal or non-animate object)
is defined for each noun, the characteristic indicates the subject or object role of the noun with
verb phrase will take the value transferred from the analysis of the structure of the sentence into
noun phrase and verb phrase or the analysis of the structure of verb phrase into verb and noun
phrase. The description of these characteristics helps to determine following points:
• Determine each object. Here we see proper nouns and common nouns are objects.
• Determine syntactic and semantic characteristics of each noun in the paragraph, become
the premise for building the Discourse Representation Structure and determine which
noun is the antecedent of the inter-sentential anaphoric pronoun.
• Determine actions and properties of each object in the paragraph.
In following table 2, we present syntactic and semantics characteristics of each category:
Table 2. Characteristics of Word Categories.
Word
categories
Syntactic characteristics Semantic characteristics
Noun
• The unique index i for each object.
• The position index coincides with the
position index of sentence, showing that the
object appears in which sentence in paragraph.
• The index indicates human or thing (animal or
non-animate object).
• The index indicates subject or object role of
verb.
• The index that distinguishes proper noun and
common noun. Here we see proper nouns
and common nouns are objects.
• Common meaning of the
noun.
• Describe the context of the
DRS structure before and
after considering the noun:
adding the index of the object.
Verb
• The index that distinguishes transitive verb and
intransitive verb.
• The index denotes the arguments of the verb:
o The first argument shows the subject of
• Common meaning of the
verb.
• Describe the context of the
DRS structure before and
8. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
30
the verb.
o The second argument shows the object of
the verb.
o Intransitive verb has only first
argument. Transitive verb has two
arguments.
after considering the verb.
Adjective
• The index coincides with the index of the
subject.
• Common meaning of the
adjective.
• Describe the context of the
DRS structure before and
after considering the adjective.
Characteristics of noun category will be illustrated through proper nouns and common nouns in
the paragraph in Example 4 “Nhân học môn vẽ. Anh dùng bút chì. Nghĩa hỏi anh.” as follow:
Example 6: Consider proper noun “Nhân” at the first sentence.
• Syntactic characteristics:
o The index index is generated uniquely for object “Nhân”.
o The index flag_position takes the value [first] transferred from the position
index of the sentence in the analysis process of the structure of paragraph into
consecutive sentences, indicates the position of the object is in the first sentence.
o The index flag_state takes the value [subject] transferred from analysis
process of the structure of sentence into noun phrase and verb phrase, indicates
the role of noun “Nhân” is the subject of verb “học”.
o The index flag_species takes the value [human], indicates the object
“Nhân” is human.
o The index class takes the value [proper], indicates this is proper noun.
• Semantic characteristics:
o This is the name of object “Nhân”.
o The context of the DRS structure after considering this proper noun, will add the
index index of object “Nhân”.
In Prolog, above characteristics of proper noun “Nhân” are described based on the model of
[7], [9] as follows:
n(N) --> [nhân],
{
unique_integer(I),
FSP = [human],
N = syn~ (index~I ..
flag_position~FP ..
flag_state~FST ..
flag_species~FSP ..
class~proper) ..
9. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
31
sem~ (in~ DRSList ..
out~ NewDRSList)
}.
Figure 5. Describe characteristics of proper noun “Nhân” in Prolog
Example 7: Consider common noun “bút chì” (English: “pencil”) at the second sentence
• Syntactic characteristics:
o The index index is generated uniquely for object “bút chì”.
o The index flag_position takes the value [second] transferred from the
position index of the sentence in the analysis process of the structure of
paragraph into consecutive sentences, indicates the position of the object is in the
second sentence.
o The index flag_state takes the value [object] transferred from analysis
process of the structure of verb phrase into verb and noun phrase, indicates the
role of noun “bút chì” is the object of verb “dùng”.
o The index flag_species takes the value [thing], indicates the object “bút chì”
is the thing.
o The index class takes the value [common], indicates this is common noun.
• Semantic characteristics:
o The common meaning is to show that it is “bút chì”.
o The context of the DRS structure after considering this common noun, will add
index index of object “bút chì”.
In Prolog, above characteristics of common noun “bút chì” are described based on the
model of [7], [9] as follows:
n(N) --> [bút,chì],
{
unique_integer(I),
FSP = [thing],
N = syn~ (index~I ..
flag_position~FP ..
flag_state~FST ..
flag_species~FSP ..
class~common) ..
10. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
32
sem~ (in~ [drs(U,Con)|Super] ..
out~ [drs([I|U],NewCon)|Super])
}.
Figure 6. Describe characteristics of common noun “bút chì” in Prolog
4.3. Building the Discourse Representation Structure
After syntactic and semantic characteristics of each word had been described, this component will
use these to build the Discourse Representation Structure – DRS of the paragraph by adding these
characteristics into set U and set Con appropriately. Building the DRS structure helps to represent
the meaning of the paragraph. The main idea here is determining each object and actions and
properties of this object. This component will based on grammatical characteristics of each noun,
adds the unique index to set U to determine each object, and then adds other characteristics of this
noun to set Con to determine characteristics of this object. Syntactic and semantic characteristics
of verb and adjective will be added to set Con to determine actions and properties of this object.
The building will be performed sequentially, right after described grammatical characteristics of
each word, when perform the next sentence, will be based on the context of the DRS structure had
been built from preceding sentences. Therefore, when specify the exact antecedent at one
preceding sentence of the inter-sentential anaphoric pronoun, this component will add to set U
and set Con actions and properties of this pronoun and associates with this antecedent.
In the following table 3, we present characteristics of predicates being added in the process of
building the DRS structure.
Table 3. Build Set U and Set Con of the DRS Structure.
Word Categories Build Set U Build Set Con
Proper noun
• Add the index Index
of the object.
• Add the characteristic: the object name condition.
• Add the characteristic: the position condition of the
object.
• Add the characteristic: the subject or object role
condition of the verb of the object.
• Add the characteristic: indicate the object is human or
thing.
Common noun
• Add the index Index
of the object.
• Add the characteristic: the meaning condition of the
object.
• Add the characteristic: the position condition of the
object.
• Add the characteristic: the subject or object role
condition of the verb of the object.
• Add the characteristic: indicate the object is human or
thing.
Transitive verb
• Add the characteristic: the meaning condition with two
arguments:
o The first argument shows the subject of the
verb.
o The second argument shows the object of the
verb.
Intransitive verb
• Add the characteristic: the meaning condition with one
argument:
o This only one argument shows the subject
of the verb.
Adjective
• Add the characteristic: the meaning condition with one
argument:
11. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
33
o This only argument shows the subject of the
adjective.
The building DRS structure will be illustrated through adding syntactic and semantic
characteristics of nouns described above as follow:
Consider proper noun “Nhân”:
• Build set U:
o Add the indexI of object “Nhân”.
• Build set Con associated with indexI:
o Add the characteristic: The object name condition – named(I,[nhân]).
o Add the characteristic: The position condition of the object –
position(I,[first]).
o Add the characteristic: The subject role condition of the object –
state(I,[subject]).
o Add the characteristic: the human characteristic – species(I,[human]).
Consider common noun “bút chì” (English: “pencil”):
• Build set U:
o Add the indexI of object “bút chì”.
• Build set Con associated with indexI:
o Add the characteristic: The meaning condition of the object – bút_chì(I).
o Add the characteristic: The position condition of the object – position(I,[
second]).
o Add the characteristic: The object role condition of the object – state(I,[
object]).
o Add the characteristic: the thing characteristic – species(I,[ thing]).
4.4. Finding the Antecedent of the Inter-sentential Anaphoric Pronoun
This component is implemented together with the component building DRS structure, will find
the antecedent of the inter-sentential anaphoric pronoun accordant with strategy proposed above,
based on unique indexes and characteristic conditions of objects in set U and set Con of the DRS
structure. The main idea here is to consider each object having unique index in set U and check
characteristic conditions of this object in set Con agreement with conditions in strategy. Based on
this idea, the algorithm finding the antecedent of inter-sentential anaphoric pronoun will be as
follow:
• Consider the pronoun standing alone:
While (index I is in U)
While (predicate associated with I is in Con)
If ((position(I) < position(pronoun)) and
12. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
34
(state(I) is [subject]) and
(species(I) is [human]))
Index of the antecedent = I
End If
End While
End While
Figure 7. The algorithm of finding antecedent for anaphoric pronoun standing alone
• Consider the pronoun standing with “ta” / “ấy” / “này” :
While (index I is in U)
While (predicate associated with I is in Con)
If ((position(I) < position(pronoun)) and
(state(I) is [object]) and
(species(I) is [human]))
Index of the antecedent = I
End If
End While
End While
Figure 8. The algorithm of finding antecedent for anaphoric pronoun standing with “ta” / “ấy” / “này”
With two algorithms finding the antecedent for inter-sentential anaphoric pronoun described
above and the component building DRS structure will perform continually after specify exact
antecedent, the finding will have following prominent features:
• Limit considering objects to the time of considering the inter-sentential anaphoric
pronoun, because only consider objects having unique index in set U.
• The found antecedent locates at the sentence which precedes the sentence containing the
inter-sentential anaphoric pronoun. This is a different point from the model of [7], [9]: the
antecedent can locate at the same sentence but precede the pronoun. The main idea here is
indexing position for each sentence in the paragraph, so can determine the position of
objects in the paragraph.
• The verb will take the found antecedent as the first or second argument, depending on the
role of this antecedent is subject or object.
• The adjective will take the found antecedent as the argument.
13. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
35
The finding algorithm is applied for paragraph in Example 5 “Lễ đọc sách. Anh thấy Chí. Anh ta
đọc báo.” (English: “Lễ reads book. He sees Chí. He reads newspaper.”) as follows:
• Consider anaphoric pronoun “anh” stand alone at the second sentence:
o Set U at the present time has 02 objects: X1 – Lễ, X2 – sách
o Set Con at the present time has predicates: named(X1,[lễ]), position(X1,[first]),
state(X1,[subject]), species(X1,[human]), sách(X2), position(X2,[first]),
state(X2,[object]), species(X1,[thing])
o Consider object X1:
The position of X1 is first < the position of “anh” is second.
The role of X1 is subject – subject of the verb.
X1 is human object.
o At the result, the found antecedent of pronoun “anh” stand alone has the index
X1.
o Set Con will add predicate: đọc(X1,X2)
• Consider anaphoric pronoun “anh” stand with “ta” at the third sentence:
o Set U at the present time has additional object: X3 – Chí
o Set Con at the present time has additional predicates: named(X3,[chí]),
position(X3,[second]), state(X3,[object]), species(X3,[human]), thấy(X1,X3)
o Consider object X3:
The position of X3 is second < The position of “anh ta” is third.
The role of X3 is object – object of the verb.
X3 is human object.
o At the result, the found antecedent of pronoun “and” stand with “ta” has the
index X3.
The component building DRS structure will continue to perform and finally building the whole
DRS structure of paragraph “Lễ đọc sách. Anh thấy Chí. Anh ta đọc báo.” as follow:
Table 4. The DRS structure of paragraph “Lễ đọc sách. Anh thấy Chí. Anh ta đọc báo.”.
X1, X2, X3, X4
named(X1,[lễ])
position(X1,[first])
state(X1,[subject])
species(X1,[human])
sách(X2)
position(X2,[first])
state(X2,[object])
species(X1,[thing])
đọc(X1,X2)
named(X3,[chí])
position(X3,[second])
state(X3,[object])
species(X3,[human])
thấy(X1,X3)
báo(X4)
position(X4,[third])
state(X4,[object])
species(X4,[thing])
đọc(X3,X4)
14. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
36
5. DEVELOPMENT AND EVALUATION
We have tested 123 Vietnamese paragraphs satisfying characteristics described at Section 3. The
system builds the DRS structure and determines the exact antecedent for anaphoric pronoun at 86
paragraphs. So, the successful rate is 70%. Analyze the result, we see that with the strategy and
model proposed above, the paragraph that correctly identified will have following characteristics:
• Have only one human object that takes the subject role of the verb and appears at the
sentence which precedes the sentence containing anaphoric pronoun standing alone.
• Have only one human object that takes the object role of the verb and appears at the
sentence which precedes that sentence containing anaphoric pronoun standing with “ấy” /
“ta” / “này”.
• Pronouns in paragraphs were defined in the system.
With Vietnamese paragraphs have not been successfully performed by the system, are divided
into the following cases:
• Do not have any object appears before anaphoric pronoun. In these paragraphs, anaphoric
pronoun can appear at the head of the first sentence, so there is no antecedent for these
pronouns.
• There are more than one objects take the same subject or object role appear at the
sentence precede the sentence containing the anaphoric pronoun. In this case, one
pronoun standing alone or standing with “ấy” / “ta” / “nay”, there can be more than one
candidate objects take the same subject or object role of verb at previous sentences. With
current strategy, cannot determine exactly which candidate is the antecedent of anaphoric
pronoun.
• There is pronoun “nó” (English: it) in the paragraph. In Vietnamese, pronoun “nó” can
indicate the human object or thing (animal or non-animate object) depending on the
context of the paragraph. In this research, we only consider nouns and pronouns indicate
human object, so cannot solve pronoun “nó”.
The analysis show that the strategy and model proposed performed successfully for major of
paragraphs.
6. CONCLUSION AND FUTURE WORK
We presented the strategy for solving inter-sentential anaphoric pronoun for Vietnamese
paragraphs composing simple sentences. The strategy proposed based on syntactic and semantic
characteristics of noun in Vietnamese and focus phenomenon into noun that takes the subject or
object role of verb when considering the anaphoric pronoun standing alone or standing with “ấy”
/ “ta” / “này”. With this strategy, we built the system model with two algorithms implemented in
Prolog, apply the model of [7], [9] with some differences and improvements accordant with
strategy as follow:
• Only considering inter-sentential anaphoric pronouns, with the antecedent appears at the
sentence precede the sentence containing pronoun.
• Do not analyse the paragraph into sentences using recursive method, instead determine
the position of each sentence.
• There is no determiner in Vietnamese, so we remove the semantic role of determiner.
• Describe characteristics of words in Vietnamese grammar.
15. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
37
• The algorithms finding the antecedent for anaphoric pronoun based on the constraints of
grammatical characteristics of noun: the position of noun in the sentence, the
characteristic indicates that this is human object or thing (animal or non-animate object),
the subject or object role of verb.
• Focusing to the antecedent is depending on the anaphoric pronoun standing alone or
standing with “ấy” / “ta” / “này”.
The strategy and model proposed performed successfully for major of Vietnamese paragraphs
tested. The main reason is that we have based on grammar of words in Vietnamese, help to
exactly analyze characteristics of paragraph, which propose appropriate treatment strategy.
Analysing the experiment show that there are paragraphs that this strategy and model cannot
resolve the inter-sentential anaphoric pronoun. This requires deeper understanding of Vietnamese
linguistic theory so that can exactly analyse these cases. In future work, we continue to follow the
current approach, further research on grammatical characteristics of lexical in Vietnamese
paragraphs so that can resolve the inter-sentential anaphoric pronoun in following cases:
• Consider the thing (animal or non-animate object), not limit to human object. This
requires the well research to resolve the pronoun “nó” in Vietnamese paragraphs.
• Consider the paragraphs that have more than one object take the same subject or object
role of verb appear in the sentence precede the sentence containing anaphoric pronoun.
REFERENCES
[1] Aravind K. Joshi & Scott Weinstein, (1981) “Control of inference: Role of some aspects of
discourse structure centering”, Proceedings of International Joint Conference on Artificial
Intelligence (IJCAI’1981), pp385–387.
[2] Barbara J. Grosz, Aravind K. Joshi & Scott Weinstein, (1995) “Centering: a framework for
modeling the local coherence of discourse”, Computational Linguistics, Vol. 21, No. 2, pp203–
225.
[3] Francis Cornish, (2009) “Inter-sentential anaphora and coherence relations in discourse: a perfect
match”, Language Sciences, Vol. 31, No. 5, pp572–592.
[4] Hans Kamp, “A theory of truth and semantic representation”, (1981) Formal methods in the study
of language, pp277–322, University of Amsterdam.
[5] Jaime G. Carbonell & Ralf D. Brown, (1988) “Anaphora Resolution: A Multi-Strategy
Approach”,Proceedings of the 12th International Conference on Computational Linguistics, pp96-
101.
[6] José Abraços & José Gabriel Lopes, (1994) “Extending DRT with a focussing mechanism for
pronominal anaphora and ellipsis resolution”, Proceedings of the 15th International Conference
on Computational Linguistics (COLING'94), pp1128-1132, Kyoto, Japan.
[7] Mark Johnson & Ewan Klein, (1986) “Discourse, anaphora and parsing”, Report No. CSLI-86-63,
Center for the Study of Language and Information, Stanford University, USA. Also in Proceedings
of Coling86, pp669-675.
[8] Michael A. Covington, (2007) “GULP 4: An Extension of Prolog for Unification Based
Grammar”, Research Report AI-1994-06, Artificial Intelligence Center, The University of
Georgia, USA.
[9] Michael A. Covington & Nora Schmitz, (1989) “An Implementation of Discourse Representation
Theory”, Advanced Computational Methods Center, The University of Georgia, USA.
[10] Michael A. Covington, Donald Nute, Nora Schmitz & David Goodman, (1988) “From English to
Prolog via Discourse Representation Theory”, ACMC Research Report 01-0024, The University of
Georgia, USA.
[11] Michel Denber, (1998) “Automatic Resolution of Anaphora in English”, Technical report,
Eastman Kodak Co.
16. International Journal on Natural Language Computing (IJNLC) Vol. 2, No.3, June 2013
38
[12] Patrick Blackburn & Johan Bos, (1999) “Representation and Inference for Natural Language -
Volume II: Working with Discourse Representation Structures”, Department of Computational
Linguistics, University of Saarland, Germany.
[13] Stuart M. Shieber, (2003) An introduction to unification-based approaches to grammar,
Microtome Publishing Brookline, Massachusetts.
[14] Tyne Liang & Dian-Song Wu, (2004) “Automatic Pronominal Anaphora Resolution in English
Texts”, Computational Linguistics and Chinese Language Processing, Vol. 9, No.1, pp21-40.