This document summarizes an investigation into improving the performance of a sampling-based alignment method for statistical machine translation. It proposes two contributions: 1) A method to enforce alignment of n-grams in distinct translation subtables to increase the number of longer n-grams, and 2) Examining combining phrase translation tables from the sampling method and MGIZA++, finding it slightly outperforms MGIZA++ alone and helps reduce out-of-vocabulary words. The method divides the parallel corpus into "unigramized" source-target n-gram subtables, runs the sampling aligner on each, and merges the subtables' phrase tables.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONcsandit
There are sixteen known methods for automatic text summarization. In our study we will use Natural language processing NLP within hybrid approach that will improve the quality of important sentences selection by thickening sentence score along with reducing the number of long sentences that would be included in the final summarization. The based approach which is used in the algorithm is Term Occurrences.
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
We have attempted in this paper to reduce the number of checked condition through saving frequency of the
tandem replicated words, and also using non-overlapping iterative neighbor intervals on plane sweep
algorithm. The essential idea of non-overlapping iterative neighbor search in a document lies in focusing
the search not on the full space of solutions but on a smaller subspace considering non-overlapping
intervals defined by the solutions. Subspace is defined by the range near the specified minimum keyword.
We repeatedly pick a range up and flip the unsatisfied keywords, so the relevant ranges are detected. The
proposed method tries to improve the plane sweep algorithm by efficiently calculating the minimal group of
words and enumerating intervals in a document which contain the minimum frequency keyword. It
decreases the number of comparison and creates the best state of optimized search algorithm especially in
a high volume of data. Efficiency and reliability are also increased compared to the previous modes of the
technical approach.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATIONcsandit
There are sixteen known methods for automatic text summarization. In our study we will use Natural language processing NLP within hybrid approach that will improve the quality of important sentences selection by thickening sentence score along with reducing the number of long sentences that would be included in the final summarization. The based approach which is used in the algorithm is Term Occurrences.
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
We have attempted in this paper to reduce the number of checked condition through saving frequency of the
tandem replicated words, and also using non-overlapping iterative neighbor intervals on plane sweep
algorithm. The essential idea of non-overlapping iterative neighbor search in a document lies in focusing
the search not on the full space of solutions but on a smaller subspace considering non-overlapping
intervals defined by the solutions. Subspace is defined by the range near the specified minimum keyword.
We repeatedly pick a range up and flip the unsatisfied keywords, so the relevant ranges are detected. The
proposed method tries to improve the plane sweep algorithm by efficiently calculating the minimal group of
words and enumerating intervals in a document which contain the minimum frequency keyword. It
decreases the number of comparison and creates the best state of optimized search algorithm especially in
a high volume of data. Efficiency and reliability are also increased compared to the previous modes of the
technical approach.
Conceptual framework for abstractive text summarizationijnlc
As the volume of information available on the Internet increases, there is a growing need for tools helping users to find, filter and manage these resources. While more and more textual information is available on-line, effective retrieval is difficult without proper indexing and summarization of the content. One of the possible solutions to this problem is abstractive text summarization. The idea is to propose a system that will accept single document as input in English and processes the input by building a rich semantic graph and then reducing this graph for generating the final summary.
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. Open tool https://github.com/aaronlifenghan/aaron-project-lepor & https://github.com/aaronlifenghan/aaron-project-hlepor(ACM digital library, ACL anthology)
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
The Volume of text resources have been increasing in digital libraries and internet. Organizing these text documents has become a practical need. For organizing great number of objects into small or minimum number of coherent groups automatically, Clustering technique is used. These documents are widely used for information retrieval and Natural Language processing tasks. Different Clustering algorithms require a metric for quantifying how dissimilar two given documents are. This difference is often measured by similarity measure such as Euclidean distance, Cosine similarity etc. The similarity measure process in text
mining can be used to identify the suitable clustering algorithm for a specific problem. This survey discusses the existing works on text similarity by partitioning them into three significant approaches; String-based, Knowledge based and Corpus-based similarities.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Defining the generative probabilistic topic model for text summarization that aims at extracting a small subset of sentences from the corpus with respect to some given query.
Unsupervised Quality Estimation Model for English to German Translation and I...Lifeng (Aaron) Han
• Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation
o Hindawi Publishing Corporation
Authors: Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He and Yi Lu
The Scientific World Journal, Issue: Recent Advances in Information Technology. ISSN:1537-744X. SCIE, IF=1.73. http://www.hindawi.com/journals/tswj/aip/760301/
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGIJDKP
Fuzzy logic deals with degrees of truth. In this paper, we have shown how to apply fuzzy logic in text
mining in order to perform document clustering. We took an example of document clustering where the
documents had to be clustered into two categories. The method involved cleaning up the text and stemming
of words. Then, we chose ‘m’ features which differ significantly in their word frequencies (WF), normalized
by document length, between documents belonging to these two clusters. The documents to be clustered
were represented as a collection of ‘m’ normalized WF values. Fuzzy c-means (FCM) algorithm was used
to cluster these documents into two clusters. After the FCM execution finished, the documents in the two
clusters were analysed for the values of their respective ‘m’ features. It was known that documents
belonging to a document type ‘X’ tend to have higher WF values for some particular features. If the
documents belonging to a cluster had higher WF values for those same features, then that cluster was said
to represent ‘X’. By fuzzy logic, we not only get the cluster name, but also the degree to which a document
belongs to a cluster
Fast and Accurate Spelling Correction Using Trie and Damerau-levenshtein Dist...TELKOMNIKA JOURNAL
This research was intended to create a fast and accurate spelling correction system with the
ability to handle both kind of spelling errors, non-word and real word errors. Existing spelling correction
system was analyzed and was then applied some modifications to improve its accuracy and speed. The
proposed spelling correction system is then built based on the method and intuition used by existing
system along with the modifications made in previous step. The result is a various spelling correction
system using different methods. Best result is achieved by the system that uses bigram with Trie and
Damerau-Levenshtein distance with the word level accuracy of 84.62% and an average processing speed
of 18.89 ms per sentence.
Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Indexing of Arabic documents automatically based on lexical analysis kevig
The continuous information explosion through the Internet and all information sources makes it
necessary to perform all information processing activities automatically in quick and reliable
manners. In this paper, we proposed and implemented a method to automatically create and Index
for books written in Arabic language. The process depends largely on text summarization and
abstraction processes to collect main topics and statements in the book. The process is developed
in terms of accuracy and performance and results showed that this process can effectively replace
the effort of manually indexing books and document, a process that can be very useful in all
information processing and retrieval applications.
This paper introduces a novel approach to tackle the existing gap on message translations in dialogue systems. Currently, submitted messages to the dialogue systems are considered as isolated sentences. Thus, missing context information impede the disambiguation of homographs words in ambiguous sentences. Our approach solves this disambiguation problem by using concepts over existing ontologies.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical
relations to address redundancy problem in text summarization. We first examined and redefined the type of rhetorical relations that is useful to retrieve sentences with identical content and performed the identification of those relations using SVMs. By exploiting the
rhetorical relations exist between sentences, we generate clusters of similar sentences from document sets. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of candidates summary. We evaluated our
method by measuring the cohesion and separation of the clusters and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in cluster-based text summarization.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Two Level Disambiguation Model for Query TranslationIJECEIAES
Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named „two level disambiguation model‟. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
Automatic text summarization is the process of reducing the text content and retaining the
important points of the document. Generally, there are two approaches for automatic text summarization:
Extractive and Abstractive. The process of extractive based text summarization can be divided into two
phases: pre-processing and processing. In this paper, we discuss some of the extractive based text
summarization approaches used by researchers. We also provide the features for extractive based text
summarization process. We also present the available linguistic preprocessing tools with their features,
which are used for automatic text summarization. The tools and parameters useful for evaluating the
generated summary are also discussed in this paper. Moreover, we explain our proposed lexical chain
analysis approach, with sample generated lexical chains, for extractive based automatic text summarization.
We also provide the evaluation results of our system generated summary. The proposed lexical chain
analysis approach can be used to solve different text mining problems like topic classification, sentiment
analysis, and summarization.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
Proceedings of the ACL 2013 EIGHTH WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August 2013. Sofia, Bulgaria. Open tool https://github.com/aaronlifenghan/aaron-project-lepor & https://github.com/aaronlifenghan/aaron-project-hlepor(ACM digital library, ACL anthology)
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
The Volume of text resources have been increasing in digital libraries and internet. Organizing these text documents has become a practical need. For organizing great number of objects into small or minimum number of coherent groups automatically, Clustering technique is used. These documents are widely used for information retrieval and Natural Language processing tasks. Different Clustering algorithms require a metric for quantifying how dissimilar two given documents are. This difference is often measured by similarity measure such as Euclidean distance, Cosine similarity etc. The similarity measure process in text
mining can be used to identify the suitable clustering algorithm for a specific problem. This survey discusses the existing works on text similarity by partitioning them into three significant approaches; String-based, Knowledge based and Corpus-based similarities.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Defining the generative probabilistic topic model for text summarization that aims at extracting a small subset of sentences from the corpus with respect to some given query.
Unsupervised Quality Estimation Model for English to German Translation and I...Lifeng (Aaron) Han
• Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation
o Hindawi Publishing Corporation
Authors: Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He and Yi Lu
The Scientific World Journal, Issue: Recent Advances in Information Technology. ISSN:1537-744X. SCIE, IF=1.73. http://www.hindawi.com/journals/tswj/aip/760301/
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGIJDKP
Fuzzy logic deals with degrees of truth. In this paper, we have shown how to apply fuzzy logic in text
mining in order to perform document clustering. We took an example of document clustering where the
documents had to be clustered into two categories. The method involved cleaning up the text and stemming
of words. Then, we chose ‘m’ features which differ significantly in their word frequencies (WF), normalized
by document length, between documents belonging to these two clusters. The documents to be clustered
were represented as a collection of ‘m’ normalized WF values. Fuzzy c-means (FCM) algorithm was used
to cluster these documents into two clusters. After the FCM execution finished, the documents in the two
clusters were analysed for the values of their respective ‘m’ features. It was known that documents
belonging to a document type ‘X’ tend to have higher WF values for some particular features. If the
documents belonging to a cluster had higher WF values for those same features, then that cluster was said
to represent ‘X’. By fuzzy logic, we not only get the cluster name, but also the degree to which a document
belongs to a cluster
Fast and Accurate Spelling Correction Using Trie and Damerau-levenshtein Dist...TELKOMNIKA JOURNAL
This research was intended to create a fast and accurate spelling correction system with the
ability to handle both kind of spelling errors, non-word and real word errors. Existing spelling correction
system was analyzed and was then applied some modifications to improve its accuracy and speed. The
proposed spelling correction system is then built based on the method and intuition used by existing
system along with the modifications made in previous step. The result is a various spelling correction
system using different methods. Best result is achieved by the system that uses bigram with Trie and
Damerau-Levenshtein distance with the word level accuracy of 84.62% and an average processing speed
of 18.89 ms per sentence.
Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Indexing of Arabic documents automatically based on lexical analysis kevig
The continuous information explosion through the Internet and all information sources makes it
necessary to perform all information processing activities automatically in quick and reliable
manners. In this paper, we proposed and implemented a method to automatically create and Index
for books written in Arabic language. The process depends largely on text summarization and
abstraction processes to collect main topics and statements in the book. The process is developed
in terms of accuracy and performance and results showed that this process can effectively replace
the effort of manually indexing books and document, a process that can be very useful in all
information processing and retrieval applications.
This paper introduces a novel approach to tackle the existing gap on message translations in dialogue systems. Currently, submitted messages to the dialogue systems are considered as isolated sentences. Thus, missing context information impede the disambiguation of homographs words in ambiguous sentences. Our approach solves this disambiguation problem by using concepts over existing ontologies.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical
relations to address redundancy problem in text summarization. We first examined and redefined the type of rhetorical relations that is useful to retrieve sentences with identical content and performed the identification of those relations using SVMs. By exploiting the
rhetorical relations exist between sentences, we generate clusters of similar sentences from document sets. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of candidates summary. We evaluated our
method by measuring the cohesion and separation of the clusters and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in cluster-based text summarization.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
Two Level Disambiguation Model for Query TranslationIJECEIAES
Selection of the most suitable translation among all translation candidates returned by bilingual dictionary has always been quiet challenging task for any cross language query translation. Researchers have frequently tried to use word co-occurrence statistics to determine the most probable translation for user query. Algorithms using such statistics have certain shortcomings, which are focused in this paper. We propose a novel method for ambiguity resolution, named „two level disambiguation model‟. At first level disambiguation, the model properly weighs the importance of translation alternatives of query terms obtained from the dictionary. The importance factor measures the probability of a translation candidate of being selected as the final translation of a query term. This removes the problem of taking binary decision for translation candidates. At second level disambiguation, the model targets the user query as a single concept and deduces the translation of all query terms simultaneously, taking into account the weights of translation alternatives also. This is contrary to previous researches which select translation for each word in source language query independently. The experimental result with English-Hindi cross language information retrieval shows that the proposed two level disambiguation model achieved 79.53% and 83.50% of monolingual translation and 21.11% and 17.36% improvement compared to greedy disambiguation strategies in terms of MAP for short and long queries respectively.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
GENERATING SUMMARIES USING SENTENCE COMPRESSION AND STATISTICAL MEASURESijnlc
In this paper, we propose a compression based multi-document summarization technique by incorporating
word bigram probability and word co-occurrence measure. First we implemented a graph based technique
to achieve sentence compression and information fusion. In the second step, we use hand-crafted rule
based syntactic constraint to prune our compressed sentences. Finally we use probabilistic measure while
exploiting word co-occurrence within a sentence to obtain our summaries. The system can generate summaries for any user-defined compression rate.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A survey on phrase structure learning methods for text classificationijnlc
Text classification is a task of automatic classification of text into one of the predefined categories. The
problem of text classification has been widely studied in different communities like natural language
processing, data mining and information retrieval. Text classification is an important constituent in many
information management tasks like topic identification, spam filtering, email routing, language
identification, genre classification, readability assessment etc. The performance of text classification
improves notably when phrase patterns are used. The use of phrase patterns helps in capturing non-local
behaviours and thus helps in the improvement of text classification task. Phrase structure extraction is the
first step to continue with the phrase pattern identification. In this survey, detailed study of phrase structure
learning methods have been carried out. This will enable future work in several NLP tasks, which uses
syntactic information from phrase structure like grammar checkers, question answering, information
extraction, machine translation, text classification. The paper also provides different levels of classification
and detailed comparison of the phrase structure learning methods.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Genetic Approach For Arabic Part Of Speech Taggingkevig
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
GENETIC APPROACH FOR ARABIC PART OF SPEECH TAGGINGijnlc
With the growing number of textual resources available, the ability to understand them becomes critical.
An essential first step in understanding these sources is the ability to identify the parts-of-speech in each
sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech
tagging. In this paper, our goal is to propose, improve, and implement a part-of-speech tagger based on a
genetic algorithm. The accuracy obtained with this method is comparable to that of other probabilistic
approaches.
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...ijnlc
The quality of Neural Machine Translation (NMT) systems like Statistical Machine Translation (SMT) systems, heavily depends on the size of training data set, while for some pairs of languages, high-quality parallel data are poor resources. In order to respond to this low-resourced training data bottleneck reality, we employ the pivoting approach in both neural MT and statistical MT frameworks. During our experiments on the Persian-Spanish, taken as an under-resourced translation task, we discovered that, the aforementioned method, in both frameworks, significantly improves the translation quality in comparison to the standard direct translation approach.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
AN INVESTIGATION OF THE SAMPLING-BASED ALIGNMENT METHOD AND ITS CONTRIBUTIONS
1. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 4, July 2013
DOI : 10.5121/ijaia.2013.4402 9
AN INVESTIGATION OF THE SAMPLING-BASED
ALIGNMENT METHOD AND ITS CONTRIBUTIONS
Juan Luo1
and Yves Lepage2
1,2
Graduate School of Information, Production and Systems, Waseda University
2-7 Hibikino, Wakamatsu-ku, Fukuoka 808-0135, Japan
1
juan.luo@suou.waseda.jp and 2
yves.lepage@waseda.jp
ABSTRACT
By investigating the distribution of phrase pairs in phrase translation tables, the work in this paper
describes an approach to increase the number of n-gram alignments in phrase translation tables output by
a sampling-based alignment method. This approach consists in enforcing the alignment of n-grams in
distinct translation subtables so as to increase the number of n-grams. Standard normal distribution is used
to allot alignment time among translation subtables, which results in adjustment of the distribution of n-
grams. This leads to better evaluation results on statistical machine translation tasks than the original
sampling-based alignment approach. Furthermore, the translation quality obtained by merging phrase
translation tables computed from the sampling-based alignment method and from MGIZA++ is examined.
KEYWORDS
Alignment, Phrase Translation Table, Statistical Machine Translation Task.
1. INTRODUCTION
Sub-sentential alignment plays an important role in the process of building a machine translation
system. The quality of the sub-sentential alignment, which identifies the relations between words
or phrases in the source language and those in the target language, is crucial for the final results
and the quality of a machine translation system. Currently, the most widely used state-of-the-art
alignment tool is GIZA++ [1], which belongs to the estimating trend. It trains the ubiquitous IBM
models [2] and the HMM introduced by [3]. MGIZA++ is a multi-threaded word aligner based on
GIZA++, originally proposed by [4].
In this paper, we focus on investigating a different alignment approach to the production of
phrase translation tables: the sampling-based approach [5]. There are two contributions of this
paper:
● Firstly, we propose a method to improve the performance of this sampling-based alignment
approach;
● Secondly, although evaluation results show that it lags behind MGIZA++, we show that, in
combination with the state-of-the-art method, it slightly outperforms MGIZA++ alone and
helps significantly to reduce the number of out-of-vocabulary words.
The paper is organized as follows. Section 2 presents related work. In Section 3, we briefly
review the technique of sampling-based alignment method. In Section 4, we propose a variant in
order to improve its performance. We also introduce standard normal distribution of time to bias
2. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 4, July 2013
10
the distribution of n-grams in phrase translation tables. Section 5 presents results obtained by
merging two aligners' phrase translation tables. Finally, in Section 6, conclusions and possible
directions for future work are presented.
2. RELATED WORK
There are various methods and models being suggested and implemented to solve the problem of
alignment. One can identify two trends to solve this problem [6]. On one side, there is the
associative alignment trend, which is illustrated by [7, 8, 9]. On the other side, the estimating
trend is illustrated by [1, 2, 10].
Associative alignment method employs similarity measures and association tests. These measures
are meant to rank and determine if word pairs are strongly associated with each other. In [7], Gale
and Church propose to use measures of association to find correspondences between words. They
introduce the Φ2
coefficient, based on a two by two contingency table. Melamed [8] shows that
most source words tend to correspond to only one target word and presented methods for biasing
statistical translation models, which leads to positive impact on identifying translational
equivalence. In [9], Moore proposes the log-likelihood-ratio association measure and alignment
algorithm, which is faster and simpler than the generative probabilistic framework. The
estimating alignment approach employs statistical models and the parameters are estimated
through maximization process. In [1, 2], a set of word alignment models are introduced and
phrase alignments are extracted given these word alignments. Liang et al. [10] propose a
symmetric alignment, which trains two asymmetric models jointly to maximize agreement
between the models.
3. SAMPLING-BASED ALIGNMENT METHOD
The sampling-based approach is implemented in a free open-source tool called Anymalign
(http://anymalign.limsi.fr/). It is in line with the associative alignment trend and it is much
simpler than the models implemented in MGIZA++. The sampling-based alignment approach
takes as input a sentence-aligned corpus and output pairs of sub-sentential sequences similar to
those in phrase translation tables, in a single step. The approach exploits low frequency terms and
relies on distribution similarities to extract sub-sentential alignments. In addition, it has been
shown in [11] that the sampling-based method, i.e., Anymalign, requires less memory in
comparison with GIZA++. As a last and remarkable feature, it is capable of aligning multiple
languages simultaneously [5], but we will not use this feature in this paper as we will restrain
ourselves to bilingual experiments.
In sampling-based alignment, low frequency terms and distribution similarities lay the foundation
for sub-sentential alignment. Low frequency terms, especially hapaxes, have been shown to
safely align across languages [12]. Hapaxes are words that occur only once in the input corpus. It
has been observed that the translational equivalence between hapaxes, which co-occur together in
a parallel sentence, is highly reliable. Aligned hapaxes have exactly the same trivial distribution
on lines (Here, “line” denotes a (source, target) sentence pair in a parallel corpus): 0 everywhere,
except 1 on the unique line they appear in. On the other end of the frequency spectrum, fullstops
at the end of each sentence in both source and target languages have the same trivial distribution
on lines if one line contains one sentence: 1 everywhere. Building on these observations and,
exploiting the possibility of sampling a corpus in many sub corpora, only those sequences of
words sharing the exact same distribution (i.e., they appear exactly in the same sentences of the
corpus) are considered for alignment.
3. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 4, July 2013
11
The key idea is to make more words share the same distribution by artificially reducing their
frequency in multiple random subcorpora obtained by sampling. The distribution here is denoted
as the co-occurrences between words in the context of parallel sentences. Indeed, the smaller a
subcorpus, the less frequent its words, and the more likely they are to share the same distribution;
hence the higher the proportion of words aligned in this subcorpus.
The subcorpus selection process is guided by a probability distribution which ensures a proper
coverage of the input parallel corpus:
)/1log(
1
)(
nkk
kp
−
−
= (to be normalized) (1)
where k denotes the size (number of sentences) of a subcorpus and n the size of the complete
input corpus. Note that this function is very close to 1/k2
: it gives much more credit to small
subcorpora, which happen to be the most productive [5]. Once the size of a subcorpus has been
chosen according to this distribution, its sentences are randomly selected from the complete input
corpus according to a uniform distribution. Then, from each subcorpus, sequences of words that
share the same distribution are extracted to constitute alignments along with the number of times
they were aligned (contrary to the widely used terminology where it denotes a set of links
between the source and target words of a sentence pair, we call “alignment'” a (source, target)
phrase pair, i.e., it corresponds to an entry in the so-called phrase translation tables). Eventually,
the list of alignments is turned into a full-fledged phrase translation table, by calculating various
features for each alignment. In the following, we use two translation probabilities and two lexical
weights as proposed by [13], as well as the commonly used phrase penalty, for a total of five
features.
One important characteristic of the sampling-based alignment method is that it is implemented
with an anytime algorithm: the number of random subcorpora to be processed is not set in
advance, so the alignment process can be interrupted at any moment. Contrary to many
approaches, after a very short amount of time, quality is no more a matter of time, however
quantity is: the longer the aligner runs (i.e. the more subcorpora processed), the more alignments
produced, and the more reliable their associated translation probabilities, as they are calculated on
the basis of the number of time each alignment was obtained. This is possible because high
frequency alignments are quickly output with a fairly good estimation of their translation
probabilities. As time goes, their estimation is refined, while less frequent alignments are output
in addition.
Intuitively, since the sampling-based alignment process can be interrupted without sacrificing the
quality of alignments, it should be possible to allot more processing time for n-grams of similar
lengths in both languages and less time to very different lengths. For instance, a source bigram is
much less likely to be aligned with a target 9-gram than with a bigram or a trigram. The
experiments reported in this paper make use of the anytime feature of Anymalign and of the
possibility of allotting time freely.
3.1. Preliminary Experiment
In order to measure the performance of the sampling-based alignment approach implemented in
Anymalign in statistical machine translation tasks, we conducted a preliminary experiment and
compared with the standard alignment setting: symmetric alignments obtained from MGIZA++.
Although Anymalign and MGIZA++ are both capable of parallel processing, for fair comparison
in time, we run them as single processes in all our experiments.
4. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 4, July 2013
12
3.1.1. Experimental Setup
A sample of the French-English parts of the Europarl parallel corpus [14] was used for training,
tuning and testing. A detailed description of the data used in the experiments is given in Table 1.
The training corpus is made of 100k sentences. The development set contains 500 sentences, and
1,000 sentences were used for testing. To perform the experiments, a standard statistical machine
translation system was built for each different alignment setting, using the Moses decoder [15],
MERT (Minimum Error Rate Training) to tune the parameters of translation tables [16], and the
SRI Language Modeling toolkit [17] to build the target language model. As for the evaluation of
translations, four standard automatic evaluation metrics were used: WER [18], BLEU [19], NIST
[20], and TER [21].
Table 1. Statistics on the French-English parallel corpus used for the training, development, and test sets.
French English
Train sentences
words
words/sentence
100,000
3,986,438
38
100,000
2,824,579
27
Dev sentences
words
words/sentence
500
18,120
36
500
13,261
26
Test sentences
words
words/sentence
1,000
38,936
37
1,000
27,965
27
3.1.2. Problem Definition
In a first setting, we evaluated the quality of translations output by the Moses decoder using the
phrase translation table obtained by making MGIZA++'s alignments symmetric. In a second
setting, this phrase translation table was simply replaced by that produced by Anymalign. Since
Anymalign can be stopped at any time, for a fair comparison, it was run for the same amount of
time as MGIZA++: seven hours in total. The experimental results are shown in Table 2.
Table 2. Evaluation results on a statistical machine translation task using phrase tables obtained from
MGIZA++ and Anymalign (baseline).
BLEU NIST WER TER
MGIZA++ 0.2742 6.6747 0.5714 0.6170
Anymalign 0.2285 6.0764 0.6186 0.6634
In order to investigate the differences between MGIZA++ and Anymalign phrase translation
tables, we analyzed the distribution of n-grams of both aligners. The distributions are shown in
Table 6(a) and Table 6(b). In Anymalign's phrase translation table, the number of alignments is 8
times that of 1×1 n-grams in MGIZA++ translation table, or twice the number of 1×2 n-grams or
2×1 n-grams in MGIZA++ translation table. Along the diagonal (m×m n-grams), the number of
alignments in Anymalign table is more than 10 times less than in MGIZA++ table. This confirms
the results given in [22] that the sampling-based approach excels in aligning unigrams, which
makes it better at multilingual lexicon induction than, e.g., MGIZA++. However, its phrase
translation tables do not reach the performance of symmetric alignments from MGIZA++ on
translation tasks. This basically comes from the fact that Anymalign does not align enough long
n-grams [22]. Longer n-grams are essential in a phrase-based machine translation system as they
contribute to the fluency of translations.
5. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 4, July 2013
13
4. DIVIDING INTO PHRASE TRANSLATION SUBTABLES
4.1. Enforcing Alignment of N-grams
To solve the above-mentioned problem, we propose a method to force the sampling-based
approach to align more n-grams.
Consider that we have a parallel input corpus, i.e., a list of (source, target) sentence pairs, for
instance, in French and English. Groups of characters that are separated by spaces in these
sentences are considered as words. Single words are referred to as unigrams, and sequences of
two and three words are called bigrams and trigrams, respectively. Theoretically, since the
sampling-based alignment method excels at aligning unigrams, we could improve it by making it
align bigrams, trigrams, or even longer n-grams as if they were unigrams. We do this by replacing
spaces between words by underscore symbols and reduplicating words as many times as needed,
which allows making bigrams, trigrams, and longer n-grams appear as unigrams. Table 3 depicts
the way of forcing n-grams into unigrams.
Table 3. Transforming n-grams into unigrams by inserting underscores and reduplicating words for both the
French part and English part of the input parallel corpus.
n French English
1 le debat est clos . the debate is closed .
2 le_debat debat_est est_clos clos_. the_debate debate_is is_closed closed_.
3 le_debat_est debat_est_clos est_clos_. the_debate_is debate_is_closed is_closed_.
4 le_debat_est_clos debat_est_clos_. the_debate_is_closed debate_is_closed_.
5 le_debat_est_clos_. the_debate_is_closed_.
Similar works on the idea of enlarging n-grams have been reported in [23], in which "word
packing" is used to obtain 1-to-n alignments based on co-occurrence frequencies, and [24], in
which collocation segmentation is performed on bilingual corpus to extract n-to-m alignments.
4.2. Phrase Translation Subtables
It is thus possible to use various parallel corpora, with different segmentation schemes in the
source and target parts. We refer to a parallel corpus where source n-grams and target m-grams
are assimilated to unigrams as an unigramized n-m corpus. These corpora are then used as input
to Anymalign to produce phrase translation subtables, as shown in Table 4. Practically, we call
Anymalign1-N the process of running Anymalign with all possible unigramized n-m corpora,
with n and m both ranging from 1 to a given N. In total, Anymalign is thus run N×N times. All
phrase translation subtables are finally merged together into one large translation table, where
translation probabilities are re-estimated given the complete set of alignments.
Table 4. List of n-gram translation subtables (TT) generated from the training corpus. These subtables are
then merged together into a single phrase translation table.
Source
Target
1-grams 2-grams 3-grams … N-grams
1-grams TT1×1 TT1×2 TT1×3 … TT1×N
2-grams TT2×1 TT2×2 TT2×3 … TT2×N
3-grams TT3×1 TT3×2 TT3×3 … TT3×N
… … … … … …
N-grams TTN×1 TTN×2 TTN×3 … TTN×N
6. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 4, July 2013
14
Although Anymalign is capable of directly producing alignments of sequences of words, we use it
with a simple filter (option -N 1 in the program), so that it only produces (typographic) unigrams
in output, i.e., n-grams and m-grams assimilated to unigrams in the input corpus. This choice was
made because it is useless to produce alignment of sequences of words, since we are only
interested in phrases in the subsequent machine translation tasks. Those phrases are already
contained in our (typographic) unigrams: all we need to do to get the original segmentation is to
remove underscores from the alignments.
4.3. Equal Time Configuration
The same experimental process (i.e., replacing the translation table), as in the preliminary
experiment, was carried out on Anymalign1-N with equal time distribution, which is, uniformly
distributed time among subtables. For a fair comparison, the same amount of time was given:
seven hours in total. The results are shown in Table 7. On the whole, MGIZA++ significantly
outperforms Anymalign, by more than 4 BLEU points. The proposed approach (Anymalign1-N)
produces better results than Anymalign in its basic version, with the best increase with
Anymalign1-3 or Anymalign1-4 (+1.3 BP).
The comparison of Table 6(a) and Table 6(c) shows that Anymalign1-N delivers too many
alignments outside of the diagonal (m×m n-grams) and still not enough along the diagonal.
Consequently, this number of alignments should be lowered. A way of doing so is by giving less
time for alignments outside of the diagonal.
4.4. Standard Normal Time Distribution
In order to increase the number of phrase pairs along the diagonal of the translation table matrix
and decrease this number outside the diagonal (Table 4), we distribute the total alignment time
among translation subtables according to the standard normal distribution as it is the most natural
distribution intuitively fitting the distribution observed in Table 6(a).
2
)(
2
1
2
1
),(
mn
emn
−−
=
π
φ (2)
The alignment time allotted to the subtable between source n-grams and target m-grams will thus
be proportional to φ(n,m). Table 5 shows an example of alignment times allotted to each subtable
up to 4-grams, for a total processing time of 7 hours.
Table 5. Alignment time in seconds allotted to each unigramized parallel corpus of Anymalign1-4. The sum
of the figures in all cells amounts to seven hours (7 hrs = 25,200 seconds).
Source
Target
1-grams 2-grams 3-grams 4-grams
1-grams 3,072 1,863 416 34
2-grams 1,863 3,072 1,863 416
3-grams 416 1,863 3,072 1,863
4-grams 34 416 1,863 3,072
We performed a third evaluation using the standard normal distribution of time, as in previous
experiments, again with a total amount of processing time (7 hours).
The comparison between MGIZA++, Anymalign in its standard use (baseline), and Anymalign1-
N with standard normal time distribution is shown in Table 7. Anymalign1-4 shows the best
performance in terms of BLEU and WER scores, while Anymalign1-3 gets the best results for the
two other evaluation metrics. There is an increase in BLEU scores for almost all Anymalign1-N,
8. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 4, July 2013
16
Table 7. Evaluation results (MGIZA++, the original Anymalign (baseline), and Anymalign1-N).
BLEU NIST WER TER
MGIZA++ 0.2742 6.6747 0.5714 0.6170
Anymalign 0.2285 6.0764 0.6186 0.6634
Anymalign1-N equal time configuration std. norm. time distribution
BLEU NIST WER TER BLEU NIST WER TER
Anymalign1-10 0.2182 5.8534 0.6475 0.6886 0.2361 6.1803 0.6192 0.6587
Anymalign1-9 0.2296 6.0261 0.6279 0.6722 0.2402 6.1928 0.6136 0.6564
Anymalign1-8 0.2253 5.9777 0.6353 0.6794 0.2366 6.1639 0.6151 0.6597
Anymalign1-7 0.2371 6.2107 0.6157 0.6559 0.2405 6.2124 0.6136 0.6564
Anymalign1-6 0.2349 6.1574 0.6193 0.6634 0.2403 6.1595 0.6165 0.6589
Anymalign1-5 0.2376 6.2331 0.6099 0.6551 0.2436 6.2426 0.6134 0.6548
Anymalign1-4 0.2423 6.2087 0.6142 0.6583 0.2442 6.2844 0.6071 0.6526
Anymalign1-3 0.2403 6.3009 0.6075 0.6507 0.2441 6.2928 0.6079 0.6517
Anymalign1-2 0.2406 6.2789 0.6121 0.6536 0.2404 6.2674 0.6121 0.6535
Anymalign1-1 0.1984 5.6353 0.6818 0.7188 0.1984 5.6353 0.6818 0.7188
Again, we investigated the number of entries in Anymalign1-N run with this normal time
distribution. We compare the number of entries in Table 6 in Anymalign1-4 with (c) equal time
configuration and (d) standard normal time distribution. The number of phrase pairs on the
diagonal roughly doubled when using standard normal time distribution. We can see a significant
increase in the number of phrase pairs of similar lengths, while the number of phrase pairs with
different lengths tends to decrease slightly. This means that the standard normal time distribution
allowed us to produce much more numerous useful alignments (a priori, phrase pairs with similar
lengths), while maintaining the noise (phrase pairs with different lengths) to a low level, which is
a neat advantage over the original method.
5. MERGING PHRASE TRANSLATION TABLES
In order to check exactly how different the phrase translation table of MGIZA++ and that of
Anymalign are, we performed a fourth set of experiments in which MGIZA++'s translation table
is merged with that of Anymalign baseline and we used the union of the two phrase translation
tables. As for feature scores in phrase translation tables for the intersection part of both aligners,
i.e., entries in two translation tables share the same phrase pairs but with different feature scores,
we adopted parameters computed either by MGIZA++ or by Anymalign for evaluation.
In addition, we used the feature Backoff model in Moses. This feature allows the use of two
phrase translation tables in the process of decoding. The second phrase translation table is used as
a backoff for unknown words (i.e., words that cannot be found in the first phrase translation table).
To examine how good 1-grams are produced by Anymalign and how they can benefit a machine
translation system, we used MGIZA++ as the first table and Anymalign baseline as the backoff
table for unknown words in the experiments. We also experimented on limiting the n-grams that
were used from backoff table.
Evaluation results on machine translation tasks with merged translation tables are given in Table
8.This setting outperforms MGIZA++ on BLEU scores, as well as three other evaluation metrics.
The phrase translation table with Anymalign parameters for the intersection part is slightly behind
the phrase translation table with MGIZA++ parameters. This may indicate that the feature scores
in Anymalign phrase translation table need to be revised. In Anymalign, the frequency counts of
phrase pairs are collected from subcorpora. A possible revision of computation of feature scores
would be to count the number of phrase pairs from the whole corpus.
9. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 4, July 2013
17
Table 8. Evaluation results (MGIZA++, the original Anymalign (baseline), merged translation tables, and
backoff models).
BLEU NIST WER TER
MGIZA++ 0.2742 6.6747 0.5714 0.6170
Anymalign 0.2285 6.0764 0.6186 0.6634
Merge (Anymalign param.) 0.2747 6.7101 0.5671 0.6128
Merge (MGIZA++ param.) 0.2754 6.7060 0.5685 0.6142
Backoff model (1-grams) 0.2809 6.7546 0.5634 0.6080
Backoff model (2-grams) 0.2809 6.7546 0.5634 0.6080
Backoff model (3-grams) 0.2804 6.7546 0.5634 0.6081
Backoff model (4-grams) 0.2805 6.7547 0.5634 0.6082
Backoff model (5-grams) 0.2804 6.7546 0.5633 0.6081
Backoff model (6-grams) 0.2804 6.7546 0.5633 0.6081
Backoff model (7-grams) 0.2804 6.7546 0.5633 0.6081
Evaluation results on using backoff models show that unigrams produced by Anymalign help in
reducing the number of unknown words and thus contribute to the increase in BLEU scores. To
analyze furthermore, the number of unique n-grams in the test set (French) that can be found in
phrase translation tables is shown in Table 9. Anymalign gives greater lexical (1-grams) coverage
than MGIZA++ and it reduces the number of unknown words on the test corpus. There are 341
unique unigrams from the French test corpus that cannot be found in the MGIZA++'s phrase
translation table. These unigrams are unkown words for the MGIZA++ table and they are about
twice the number of unknown words for Anymalign phrase translation table. For n-grams (n ≥ 2),
Anymalign gives less coverage than MGIZA++. An analysis of overlaps and differences between
two phrase translation tables is given in Table 10. It shows that 7% of phrase pairs produced by
Anymalign overlap with those of MGIZA++. This shows clearly that the two methods produce
different phrases.
Table 9. Number of unique n-grams in French test set found in phrase translation tables.
n-grams corpus MGIZA++ Anymalign
in TT not in TT in TT not in TT
1-gram 3885 3544 341 3696 189
2-gram 15230 10492 4738 4959 10271
3-gram 22777 8987 13790 1179 21598
4-gram 25024 4418 20606 212 24812
5-gram 25184 1748 23436 46 25138
6-gram 24666 672 23994 12 24654
7-gram 23880 271 23609 5 23875
Table 10. Analysis of overlap between two phrase translation tables.
Aligner Overlap Difference Total
MGIZA++ 90,086 3,678,331 3,768,417
Anymalign 90,086 1,281,779 1,371,865
6. CONCLUSIONS AND FUTURE WORK
In this section, we summarize the work of this research and highlight its contributions. In addition,
we suggest possible directions for future work.
In this paper, by examining the distribution of n-grams in Anymalign phrase translation tables, we
presented a method to improve the translation quality of sampling-based sub-sentential alignment
10. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 4, July 2013
18
approach implemented in Anymalign: firstly, Anymalign was forced to align n-grams as if they
were unigrams; secondly, time was unevenly distributed over subtables; thirdly, merging of two
aligners' phrase translation tables was introduced. A baseline statistical machine translation
system was built to compare the translation performance of two aligners: MGIZA++ and
Anymalign. Anymalign1-N, the method presented here, obtains significantly better results than
the original method, the best performance being obtained with Anymalign1-4. Merging
Anymalign's phrase translation table with that of MGIZA++ allows outperforming MGIZA++
alone. The use of backoff models shows that Anymalign is good for reducing the number of
unknown words.
There are arguments as from which phrase length the translation quality would benefit. [13]
suggested that phrase length up to three words contributes the most to BLEU scores, which was
confirmed for instance by [25, 26]. However, [27] argued that longer phrases should not be
neglected. As for Anymalign1-N, Anymalign1-3 and Anymalign1-4, in which phrase lengths are
limited to three and four words respectively, get the best results in four evaluation metrics among
all variants of Anymalign. This would confirm that longer phrases could indeed be a source of
noise in translation process. On the other hand, more reliable, shorter phrases, would contribute a
lot to translation quality.
In addition, a recent work by [28] shows that it is safe to discard a phrase if it can be decomposed
in shorter phrases. They note that discarding the phrase the French government, which is
compositional, does not change translation cost. On the other hand, the phrase the government of
France should be retained in the phrase translation table. It would raise the question on what
phrases constitute a good phrase translation table. As for Anymalign, we might observe the
proportion of compositional and non-compositional phrases in its phrase translation table for
future work.
Furthermore, according to the differences of the evaluation results between using Anymalign
feature scores and those of MGIZA++ in the overlapping part of their respective phrase
translation tables, we wonder whether the feature scores computed by Anymalign should be
modified in order to mimic those of MGIZA++ and better suit the expectation of the Moses
decoder. Also there is the concern on whether the distribution of phrase pairs in MGIZA++'s
translation table is ideal and what possibly different distribution in Anymalign's translation tables
would contribute to further improvement. This is an important aspect for further research.
ACKNOWLEDGEMENTS
Part of the research presented in this paper has been done under a Japanese grant-in-aid (Kakenhi
C, 23500187: Improvement of alignments and release of multilingual syntactic patterns for
statistical and example-based machine translation).
REFERENCES
[1] F.J. Och and H. Ney, “A systematic comparison of various statistical alignment models,” Computational
Linguistics, vol.29, no.1, pp.19-51, 2003.
[2] P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer, “The mathematics of statistical machine
translation: Parameter estimation,” Computational Linguistics, vol.19, no.2, pp.263-311, 1993.
[3] S. Vogel, H. Ney, and C. Tillman, “Hmm-based word alignment in statistical translation,” Proceedings of
16th International Conference on Computational Linguistics, Copenhagen, pp.836-841, 1996.
[4] Q. Gao and S. Vogel, “Parallel implementations of word alignment tool,” Software Engineering, Testing,
and Quality Assurance for Natural Language Processing, Columbus, Ohio, pp.49-57, 2008.
[5] A. Lardilleux and Y. Lepage, “Sampling-based multilingual alignment,” Proceedings of International
Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria, pp.214-218, 2009.
[6] A. Lardilleux, Y. Lepage, and F. Yvon, “The contribution of low frequencies to multilingual sub-sentential
11. International Journal of Artificial Intelligence & Applications (IJAIA), Vol. 4, No. 4, July 2013
19
alignment: a differential associative approach,” International Journal of Advanced Intelligence, vol.3, no.2,
pp.189-217, 2011.
[7] W. Gale and K. Church, “Identifying word correspondences in parallel texts,” Proceedings of 4th DARPA
workshop on Speech and Natural Language, Pacific Grove, pp.152-157, 1991.
[8] D. Melamed, “Models of translational equivalence among words,” Computational Linguistics, vol.26, no.2,
pp.221-249, 2000.
[9] R. Moore, “Association-based bilingual word alignment,” Proceedings of ACL Workshop on Building and
Using Parallel Texts, Ann Arbor, pp.1-8, 2005.
[10] P. Liang, B. Taskar, and D. Klein, “Alignment by agreement,” Proceedings of Human Language
Technology Conference of the NAACL, New York City, pp.104-111, 2006.
[11] A. Toral, M. Poch, P. Pecina, and G. Thurmair, “Efficiency-based evaluation of aligners for industrial
applications,” Proceedings of 16th Annual Conference of the European Associtation for Machine
Translation, pp.57-60, 2012.
[12] A. Lardilleux and Y. Lepage, “Hapax Legomena : their contribution in number and efficiency to word
alignment,” Lecture notes in computer science, vol.5603, pp.440-450, 2009.
[13] P. Koehn, F.J. Och, and D. Marcu, “Statistical phrase-based translation,” Proceedings of 2003 Human
Language Technology Conference of the North American Chapter of the Association for Computational
Linguistics, Edmonton, pp.48-54, 2003.
[14] P. Koehn, “Europarl: A Parallel Corpus for Statistical Machine Translation,” Proceedings of 10th Machine
Translation Summit (MT Summit X), Phuket, pp.79-86, 2005.
[15] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan,W. Shen, C. Moran,
R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, “Moses: Open source toolkit for statistical
machine translation,” Proceedings of 45th Annual Meeting of the Association for Computational
Linguistics, Prague, Czech Republic, pp.177-180, 2007.
[16] F.J. Och, “Minimum error rate training in statistical machine translation,” Proceedings of 41st Annual
Meeting on Association for Computational Linguistics, Sapporo, Japan, pp.160-167, 2003.
[17] A. Stolcke, “SRILM-an extensible language modeling toolkit,” Proceedings of 7th International
Conference on Spoken Language Processing, Denver, Colorado, pp.901-904, 2002.
[18] S. Nießen, F.J. Och, G. Leusch, and H. Ney, “An evaluation tool for machine translation: Fast evaluation
for machine translation research,” Proceedings of 2nd International Conference on Language Resources
and Evaluation, Athens, pp.39-45, 2000.
[19] K. Papineni, S. Roukos, T. Ward, and W.J. Zhu, “BLEU: a method for automatic evaluation of machine
translation,” Proceedings of 40th Annual Meeting of the Association for Computational Linguistics,
Philadelphia, pp.311-318, 2002.
[20] G. Doddington, “Automatic evaluation of machine translation quality using N-gram co-occurrence
statistics,” Proceedings of 2nd International Conference on Human Language Technology Research, San
Diego, pp.138-145, 2002.
[21] M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul, “A study of translation edit rate with
targeted human annotation,” Proceedings of Association for Machine Translation in the Americas,
Cambridge, Massachusetts, pp.223-231, 2006.
[22] A. Lardilleux, J. Chevelu, Y. Lepage, G. Putois, and J. Gosme, “Lexicons or phrase tables? An
investigation in sampling-based multilingual alignment,” Proceedings of 3rd workshop on example-based
machine translation, Dublin, Ireland, pp.45-52, 2009.
[23] Y. Ma, N. Stroppa, and A. Way, “Bootstrapping word alignment via word packing,” Proceedings of 45th
Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp.304-311,
2007.
[24] A.C. Henríquez Q., R.M. Costa-jussà, V. Daudaravicius, E.R. Banchs, and B.J. Mariño, “Using collocation
segmentation to augment the phrase table,” Proceedings of Joint FifthWorkshop on Statistical Machine
Translation and MetricsMATR, Uppsala, Sweden, pp.98-102, 2010.
[25] Y. Chen, M. Kay, and A. Eisele, “Intersecting multilingual data for faster and better statistical
translations,” Proceedings of Human Language Technologies: The 2009 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, Boulder, Colorado, pp.128-136, 2009.
[26] G. Neubig, T. Watanabe, E. Sumita, S. Mori, and T. Kawahara, “An unsupervised model for joint phrase
alignment and extraction,” Proceedings of 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies, Portland, Oregon, USA, pp.632-641, 2011.
[27] C. Callison-Burch, C. Bannard, and J. Schroeder, “Scaling phrase based statistical machine translation to
larger corpora and longer phrases,” Proceedings of 43rd Annual Meeting on Association for Computational
Linguistics, Ann Arbor, Michigan, pp.255-262, 2005.
[28] R. Zens, D. Stanton, and P. Xu, “A systematic comparison of phrase table pruning technique,” Proceedings
of Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning, Jeju Island, Korea, pp.972-983, 2012.